Skip to content

πŸŽ“ Diploma Thesis | A Word2vec comparative study of CBOW and Skipgram

License

Notifications You must be signed in to change notification settings

estamos/word2vec-thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

University of Thessaly
word2vec-thesis

Thesis as published in UTH Institutional Repository

Thesis Thesis Presentation

Developed as a part of thesis "Big Data Analytics using Machine Learning Algorithms" - A Word2vec comparative study of CBOW and Skipgram.

Dataset Releases Pre-trained models

Run

# Download latest available release
wget https://github.com/estamos/word2vec-thesis/releases/download/final/word2vec-thesis-final.tar.gz
tar -xvf word2vec-thesis-final.tar.gz
cd word2vec-thesis-final
cp test/test.py models
cd models
python test.py

Word2vec - CBOW & Skipgram Comparative Tool

Word2vec - CBOW & Skipgram Comparative Tool

Word2vec Architectures Performance Comparison Graphs

Effective words per epoch

Effective words per epoch

Training time per epoch

Training time per epoch

Word2vec Parameterization

Gensim parameter Tensorflow parameter Type Details
alpha learning_rate float The initial learning rate
cbow_mean - boolean 0: use the sum of the context word vectors
1: use the mean, only applies when cbow is used
epochs epochs int Number of iterations (epochs) over the corpus
hs - boolean 0: hierarchical softmax will be used for model training
1: if negative is non-zero, negative sampling will be used
min_count min_count int Maximum distance between the current and predicted word within asentence
negative num_neg_samples int how many "noise words" should be drawn
sample subsample float The threshold for configuring which higher-frequency words are randomly downsampled
sg - boolean 0: CBOW
1: Skipgram
vector_size embedding_dim int Dimensionality of the word vectors
window window_size int Maximum distance between the current and predicted word within a sentence

Statistics

Trained with parameters

Gensim parameter Value
window 10
min_count 2
workers 10
total_examples len(documents)
epochs 10

Total training time

CBOW Skipgram
956.5 3768.5

Total effective words

CBOW Skipgram
1327456338 1327454735

Training time per epoch

Epoch CBOW Skipgram
Average 95.65 376.85
1 95.9 338.3
2 95.3 340.0
3 96.7 339.9
4 96.1 448.0
5 95.4 339.3
6 95.3 339.8
7 95.6 339.9
8 95.3 599.3
9 95.3 342.8
10 95.6 341.2

Effective words per epoch

Epoch CBOW Skipgram
Average 132745634 132745474
1 132750757 132744876
2 132744712 132741580
3 132743879 132750658
4 132748376 132743435
5 132747942 132749631
6 132746112 132744974
7 132744511 132745877
8 132742194 132744706
9 132740767 132745693
10 132747088 132743305

Tree

.
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ dataset
β”‚Β Β  └── wiki_en_corpus.txt
β”œβ”€β”€ logs
β”‚Β Β  β”œβ”€β”€ cbow-log.rtf
β”‚Β Β  └── skipgram-log.rtf
β”œβ”€β”€ models
β”‚Β Β  β”œβ”€β”€ word2vec-cbow-trained.model
β”‚Β Β  β”œβ”€β”€ word2vec-cbow-trained.model.syn1neg.npy
β”‚Β Β  β”œβ”€β”€ word2vec-cbow-trained.model.wv.vectors.npy
β”‚Β Β  β”œβ”€β”€ word2vec-skipgram-trained.model
β”‚Β Β  β”œβ”€β”€ word2vec-skipgram-trained.model.syn1neg.npy
β”‚Β Β  └── word2vec-skipgram-trained.model.wv.vectors.npy
β”œβ”€β”€ test
β”‚Β Β  └── test.py
└── train
    β”œβ”€β”€ cbow
    β”‚Β Β  └── cbow.py
    └── skipgram
        └── skipgram.py