GitHub - junwei-pan/word2vec_torch: Word2Vec implementation in Torch

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
README		README
corpus.txt		corpus.txt
eval.lua		eval.lua
main.lua		main.lua
word2vec.lua		word2vec.lua

Repository files navigation

Word2Vec in Torch, based on Yoon Kim's code: https://github.com/yoonkim/word2vec_torch
Junwei Pan
pandevirus@gmail.com

Including both skip-gram and continuous bag-of-words architecture, see https://code.google.com/p/word2vec/ for more details.

Note: This is considerably slower than the word2vec toolkit and gensim implementations.

Input file is a text file where each line represents one sentence (see corpus.txt for an example)

Arguments are mostly self-explanatory (see main.lua for default arguments)

-corpus : text file with the corpus
-window : max window size
-dim : dimensionality of word embeddings
-alpha : exponent to smooth out unigram distribution 
-table_size : unigram table size. if you have plenty of RAM, bring this up to 10^8
-neg_samples : number of negative samples for each valid word-context pair
-minfreq : minimum frequency to be included in the vocab
-lr : starting learning rate
-min_lr : minimum learning rate--lr will linearly decay to this value
-epochs : number of epochs to run
-stream : whether to stream text data from HD or store in memory (1 = stream, 0 = not)
-gpu : whether to use gpu (1 = use gpu, 0 = not)
-mode: which architecture to use(sg = SkipGram, cw = continuous bag-of-words)
-suffix: the suffix of the stored model file

For example:

Trian

CPU:
th main.lua -corpus corpus.txt -window 3 -dim 100 -minfreq 10 -stream 1 -gpu 0 -mode sg

GPU:
th main.lua -corpus corpus.txt -window 3 -dim 100 -minfreq 10 -stream 0 -gpu 1 -mode cw

Evaluation

th eval.th -model $path_model