-
Notifications
You must be signed in to change notification settings - Fork 21
HOWTO
- Quick Introduction to the Text Classification API
The following steps are quite typical of the API use :
- generate a raw file + lexicon
- generate a vector file
- evaluate perfs of the model
- generate a model
- apply model to unclassified documents
The generation of the raw file is optional and you can generate the vector file directly. The advantage of using a raw file is that it makes it quicker to test different weighting schemes as you can do that straight from the raw file instead of having to reparse the input data.
- Step by Step
Assuming that input contains a number of XML documents containing elements such as :
category
this is the document I am using for training
java -cp textclassification-1.4.jar com.digitalpebble.classification.util.XMLCorpusReader input output
will create the corresponding raw file and lexicon in the output directory.
You can then generate a vector file with the command :
java -cp textclassification-1.4.jar com.digitalpebble.classification.util.CorpusUtils -generateVector output/raw output/lexicon output/params.ini
using a simple params file with content such as:
vector_location=vector
new_lexicon_file=lexicon.new
classification_weight_scheme=frequency
keepNBestAttributes=-1
classification_minFreq=1
This will generate a vector file at the location specified and a new (possibly modified) lexicon file. We’ll come back to the details of the params later.
The vector files are at the libsvm format, simply because the libsvm engine is used by default. This means that we can use the standard libsvm commands to operate on the data e.g. svm-train vector model
Of course all the operations above can be done in a single step using the API directly.