Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Stale pickled PunktSentenceTokenizer in nltk_data/ #123

Open
advgiarc opened this issue Oct 2, 2018 · 0 comments
Open

Stale pickled PunktSentenceTokenizer in nltk_data/ #123

advgiarc opened this issue Oct 2, 2018 · 0 comments

Comments

@advgiarc
Copy link

advgiarc commented Oct 2, 2018

It appears the pickled tokenizers are old, and do not contain current code.

https://github.com/nltk/nltk_data/blob/gh-pages/packages/tokenizers/punkt.zip

The .zip that is downloaded is older than the source code:

https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py

There are a few changes in punkt.py since the .zip was created that seem to improve the tokenization of sentences around abbreviations.

`import os
import nltk
from nltk.tokenize.punkt import PunktSentenceTokenizer

s = "Alabama Gov. Kay Ivey was asked this morning if she supported the confirmation of U.S. Circuit Judge Brett Kavanaugh to the Supreme Court. Ivey spoke with reporters this morning after a press conference about the state's new Security Operations Center and cybersecurity website."
abbrevs = ['u.s', 'gov']
nltk.data.path.append(f'{os.getcwd()}/nltk_data')
nltk.download('punkt', 'nltk_data')

pickled_tokenizer = nltk.data.load('nltk_data/tokenizers/punkt/PY3/english.pickle')
pickled_tokenizer._params.abbrev_types.update(abbrevs)
print(pickled_tokenizer.sentences_from_text(s))

fresh_tokenizer = PunktSentenceTokenizer()
fresh_tokenizer._params.abbrev_types.update(abbrevs)
print(fresh_tokenizer.sentences_from_text(s))`

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant