Skip to content
Carmen-digitalPebble edited this page Jul 11, 2012 · 1 revision
  1. Quick Introduction to the Text Classification API

The following steps are quite typical of the API use :

  • generate a raw file + lexicon
  • generate a vector file
  • evaluate perfs of the model
  • generate a model
  • apply model to unclassified documents

The generation of the raw file is optional and you can generate the vector file directly. The advantage of using a raw file is that it makes it quicker to test different weighting schemes as you can do that straight from the raw file instead of having to reparse the input data.

  1. Step by Step

Assuming that input contains a number of XML documents containing elements such as :


category
this is the document I am using for training

java -cp textclassification-1.4.jar com.digitalpebble.classification.util.XMLCorpusReader input output

will create the corresponding raw file and lexicon in the output directory.

You can then generate a vector file with the command :

java -cp textclassification-1.4.jar com.digitalpebble.classification.util.CorpusUtils -generateVector output/raw output/lexicon output/params.ini

using a simple params file with content such as:

vector_location=vector
new_lexicon_file=lexicon.new
classification_weight_scheme=frequency
keepNBestAttributes=-1
classification_minFreq=1

This will generate a vector file at the location specified and a new (possibly modified) lexicon file. We’ll come back to the details of the params later.

The vector files are at the libsvm format, simply because the libsvm engine is used by default. This means that we can use the standard libsvm commands to operate on the data e.g. svm-train vector model

Of course all the operations above can be done in a single step using the API directly.

Clone this wiki locally