Stale pickled PunktSentenceTokenizer in nltk_data/ #123

advgiarc · 2018-10-02T14:17:48Z

It appears the pickled tokenizers are old, and do not contain current code.

https://github.com/nltk/nltk_data/blob/gh-pages/packages/tokenizers/punkt.zip

The .zip that is downloaded is older than the source code:

https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py

There are a few changes in punkt.py since the .zip was created that seem to improve the tokenization of sentences around abbreviations.

`import os
import nltk
from nltk.tokenize.punkt import PunktSentenceTokenizer

s = "Alabama Gov. Kay Ivey was asked this morning if she supported the confirmation of U.S. Circuit Judge Brett Kavanaugh to the Supreme Court. Ivey spoke with reporters this morning after a press conference about the state's new Security Operations Center and cybersecurity website."
abbrevs = ['u.s', 'gov']
nltk.data.path.append(f'{os.getcwd()}/nltk_data')
nltk.download('punkt', 'nltk_data')

pickled_tokenizer = nltk.data.load('nltk_data/tokenizers/punkt/PY3/english.pickle')
pickled_tokenizer._params.abbrev_types.update(abbrevs)
print(pickled_tokenizer.sentences_from_text(s))

fresh_tokenizer = PunktSentenceTokenizer()
fresh_tokenizer._params.abbrev_types.update(abbrevs)
print(fresh_tokenizer.sentences_from_text(s))`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stale pickled PunktSentenceTokenizer in nltk_data/ #123

Stale pickled PunktSentenceTokenizer in nltk_data/ #123

advgiarc commented Oct 2, 2018

Stale pickled PunktSentenceTokenizer in nltk_data/ #123

Stale pickled PunktSentenceTokenizer in nltk_data/ #123

Comments

advgiarc commented Oct 2, 2018