This is a pure Java port of taku's crfpp(also known as crf++), which is based on codes of crfpp-0.58.
Credits to komiya's for his Java double array trie implementation.
- pure Java, with least dependencies(only commons-cli as runtime deps)
- compatible commandline options and template/input format with crfpp
- load model from classpath
- compatible text model format with crfpp
- convert text model to (our)binary model and (our)binary model to text model
- multi-threading support
- CRF-L1/CRF-L2/MIRA algorithms supports
- n-best outputs
- CRF Model wrapper for API call
- Tests and demo for usage demonstration
mvn clean package
Run tests:
mvn test
java -cp crf4j-<version>-jar-with-dependencies.jar com.github.zhifac.crf4j.CrfLearn <template file> <train datafile> <model path>
For more options, please run
java -cp crf4j-<version>-jar-with-dependencies.jar com.github.zhifac.crf4j.CrfLearn -h
For details on format of template file and train file, please refer to original page of crfpp.
to print output to console:
java -cp crf4j-<version>-jar-with-dependencies.jar com.github.zhifac.crf4j.CrfTest -m <model path> <test datafile>
to print output to file:
java -cp crf4j-<version>-jar-with-dependencies.jar com.github.zhifac.crf4j.CrfTest -m <model path> <test datafile> -o <outputfile>
please refer to CrfDemo.java.
In an example of using crf4j model to recognize name entity, we used jmeter to test 400 concurrent access to the same Http interface, and here is the result.
#Samples | Average | Median | 90% Line | Min | Max | Throughput |
---|---|---|---|---|---|---|
4000 | 41 | 4 | 60 | 0 | 746 | 1250/sec |
The test environment is:
OS | CPU | MEM |
---|---|---|
Windows 7x64 | Intel Core i5-4200U@1.60GHz | 8GB |
The binary model generated by CrfLearn is incompatible with crfpp, but the text model is. If you somehow want to reuse a crfpp model with crf4j, please generate a text model when you train with crfpp(add -t option), and then run java -cp crf4j.jar com.github.zhifac.crf4j.EncoderFeatureIndex <crfpp_text_model> <output_crf4j_binarymodel>
to convert the crfpp text model to crf4j binary model. Or if you somehow can not retrain the same text model(e.g. missing train data), you can still convert an existing crfpp binary model to text model with modified version of crfpp from here.
- Optimize memory usage when training(it currently consumes about 8GB heap memory for 24224128 features, whereas crfpp uses 2GB)
LGPL & Modified BSD
Chinese version:
crf4j: crfpp(crf++)的Java实现
(基于crfpp 0.58)