GitHub - KMiNT21/html2sent: HTML2SENT modifies HTML to improve sentences tokenizer quality

This library works with HTML-content and modifies it in some tags to improve sentences tokenizer quality.

Install NLTK python package

pip install nltk

Download punkt data

import nltk
nltk.download('punkt')

Download this library

git clone https://github.com/KMiNT21/html2sent.git

Using

import html2sent
sentences = html2sent.tokenize(html, language='english')

If you don't want to use NLTK, you can just use preprocess functions:

import html2sent
text = html2sent.html2text(html)
text = html2sent.preprocess_text(text)

Demo: demo_simple.py and demo_folder_multiprocessing.py

For russian language

Если для разделения полученного текста на предложения используется библиотека nltk, то для русского языка нужно еще скачать обученный ru_punkt-токенизатор.

Варианты:

git clone https://github.com/mhq/train_punkt.git
git clone https://github.com/Mottl/ru_punkt.git

Скопируйте файл russian.pickle в папку nltk_data (к остальным языковым .pickle файлам)

Альтернативный более точный вариант - библиотека razdel

Подробнее об использовании - https://github.com/natasha/razdel

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
demo_htmls		demo_htmls
html2sent		html2sent
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo_folder_multiprocessing.py		demo_folder_multiprocessing.py
demo_simple.py		demo_simple.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install NLTK python package

Download punkt data

Download this library

Using

For russian language

About

Releases

Packages

Languages

License

KMiNT21/html2sent

Folders and files

Latest commit

History

Repository files navigation

Install NLTK python package

Download punkt data

Download this library

Using

For russian language

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages