A general data mining C++ library
-
Keyphrase Extraction. We've implemented two kinds of keyphrase extraction approaches. One refers to the translation model from thesis work of Zhiyuan Liu, the other comes from our innovatin which uses Wiki data as the semantic knowledge base.
-
Taxonomy Generation.
-
Duplicate Detection. Read the paper
Detecting Near-Duplicates for Web Crawling
firstly then we could understand the algorithm. We used the famous Charikar simhash fingerprints generation approach and set the dimensions(f) to 64. -
Ctr Prediction. We've implemented both AdPredictor and FTRL.
-
Chinese Query Correction.
-
Collaborative Filtering. This is an item-based incremental collaborative filtering.
-
Others.
We've just switched to C++ 11
for SF1R recently, and GCC 4.8
is required to build SF1R correspondingly. We do not recommend to use Ubuntu for project building due to the nested references among lots of libraries. CentOS / Redhat / Gentoo / CoreOS are preferred platform. You also need CMake
and Boost 1.56
to build the repository . Here are the dependent repositories list:
-
cmake: The cmake modules required to build all iZENECloud C++ projects.
-
izenelib: The general purpose C++ libraries.
-
icma: The Chinese morphological analyzer library.
-
ijma: The Japanese morphological analyzer library.
-
ilplib: The language processing libraries.
The project is published under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0