- Learning document representations using subspace multinomial model. See paper
- This version of the code implements the same model, but with Adagrad optimization. This results in a slightly faster convergence with relatively lower memory requirements.
python3.6
pytorch
,numpy
,scipy
,scikit-learn
python TwentyNewsDataset.py
- This will download the data from the web and converts it into
scipy.sparse
matrix.
-
Input data:
scipy.sparse
matrix of shapen_words x n_docs
-
python run_smm_20news.py train -o exp/ -trn 100 -lw 1e-04 -rt l1 -lt 1e-4 -k 100
-
The trained model is saved as
exp/lw_1e-40_l1_1e-04_100/model_T100.pt
phase
:train
orextract
-lw
:l2
regularization const for i-vectors-rt
: type of regularization for bases (l1
orl2
)-lt
: regularization const for bases-k
: i-vector dimension
-o
: path to output directory-trn
: training iterations--ovr
: over-write existing experiment directory
-
python run_smm_20news.py extract -m exp/lw_1e-04_l1_1e-04_100/model_T100.pt -xtr 30 --nth 2
-
The document i-vectors are saved in
exp/lw_1e-40_l1_1e-04_100/ivecs/
-xtr
: extraction iterations.--nth
: save everyn
-th i-vector while extraction.
python train_and_clf.py exp/lw_1e-40_l1_1e-04_100/train_model_T100_e30.npy
- Test data and labels are automatically read.
- prefix with
CUDA_VISIBLE_DEVICES=<device_id>
followed bypython run_smm_20news.py