Standalone Nori (Korean Morphological Analyzer in Apache Lucene) written in C++.
ElasticSearch provides high-quality/performance Korean morphological analyzer nori
. But nori
's code is strongly coupled with the Lucene codebase, and nori
is written in Java that is the main language in the Lucene project. So, it's hard to use nori
standalone in Python or Golang with the same performance. Therefore, I re-implemented almost the same algorithms with nori
in Lucene using C++ for the portability and usability.
This project is written in C++, but also provides Python and Golang binding.
A dictionary/
directory is for the pre-built dictionary files that is used for distribtion and test cases. For now, there are two pre-built dictionaries, lagacy
and latest
.
legacy
dictionary does not normalize inputs, and built withmecab-ko-dic-2.0.3-20170922
that is same with original nori.latest
dictionary normalizes the inputs with the form NFKC, and built withmecab-ko-dic-2.1.1-20180720
.
For more details, check out tools/benchmark.
Check out tools/comparison.
Check out CONTRIBUTING.md