You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a few changes in punkt.py since the .zip was created that seem to improve the tokenization of sentences around abbreviations.
`import os
import nltk
from nltk.tokenize.punkt import PunktSentenceTokenizer
s = "Alabama Gov. Kay Ivey was asked this morning if she supported the confirmation of U.S. Circuit Judge Brett Kavanaugh to the Supreme Court. Ivey spoke with reporters this morning after a press conference about the state's new Security Operations Center and cybersecurity website."
abbrevs = ['u.s', 'gov']
nltk.data.path.append(f'{os.getcwd()}/nltk_data')
nltk.download('punkt', 'nltk_data')
It appears the pickled tokenizers are old, and do not contain current code.
https://github.com/nltk/nltk_data/blob/gh-pages/packages/tokenizers/punkt.zip
The .zip that is downloaded is older than the source code:
https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py
There are a few changes in punkt.py since the .zip was created that seem to improve the tokenization of sentences around abbreviations.
`import os
import nltk
from nltk.tokenize.punkt import PunktSentenceTokenizer
s = "Alabama Gov. Kay Ivey was asked this morning if she supported the confirmation of U.S. Circuit Judge Brett Kavanaugh to the Supreme Court. Ivey spoke with reporters this morning after a press conference about the state's new Security Operations Center and cybersecurity website."
abbrevs = ['u.s', 'gov']
nltk.data.path.append(f'{os.getcwd()}/nltk_data')
nltk.download('punkt', 'nltk_data')
pickled_tokenizer = nltk.data.load('nltk_data/tokenizers/punkt/PY3/english.pickle')
pickled_tokenizer._params.abbrev_types.update(abbrevs)
print(pickled_tokenizer.sentences_from_text(s))
fresh_tokenizer = PunktSentenceTokenizer()
fresh_tokenizer._params.abbrev_types.update(abbrevs)
print(fresh_tokenizer.sentences_from_text(s))`
The text was updated successfully, but these errors were encountered: