Skip to content

junwei-pan/word2vec_torch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word2Vec in Torch, based on Yoon Kim's code: https://github.com/yoonkim/word2vec_torch
Junwei Pan
pandevirus@gmail.com

Including both skip-gram and continuous bag-of-words architecture, see https://code.google.com/p/word2vec/ for more details.

Note: This is considerably slower than the word2vec toolkit and gensim implementations.

Input file is a text file where each line represents one sentence (see corpus.txt for an example)

Arguments are mostly self-explanatory (see main.lua for default arguments)

-corpus : text file with the corpus
-window : max window size
-dim : dimensionality of word embeddings
-alpha : exponent to smooth out unigram distribution 
-table_size : unigram table size. if you have plenty of RAM, bring this up to 10^8
-neg_samples : number of negative samples for each valid word-context pair
-minfreq : minimum frequency to be included in the vocab
-lr : starting learning rate
-min_lr : minimum learning rate--lr will linearly decay to this value
-epochs : number of epochs to run
-stream : whether to stream text data from HD or store in memory (1 = stream, 0 = not)
-gpu : whether to use gpu (1 = use gpu, 0 = not)
-mode: which architecture to use(sg = SkipGram, cw = continuous bag-of-words)
-suffix: the suffix of the stored model file

For example:

Trian

CPU:
th main.lua -corpus corpus.txt -window 3 -dim 100 -minfreq 10 -stream 1 -gpu 0 -mode sg

GPU:
th main.lua -corpus corpus.txt -window 3 -dim 100 -minfreq 10 -stream 0 -gpu 1 -mode cw

Evaluation

th eval.th -model $path_model


About

Word2Vec implementation in Torch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Lua 100.0%