Simple web service providing a word embedding API. The methods are based on Gensim Word2Vec implementation. Models are passed as parameters and must be in the Word2Vec text or binary format.
- Install Dependencies
pip2 install -r requirements.txt
- Launching the service
python word2vec-api --model path/to/the/model [--host host --port 1234]
or
python word2vec-api.py --model /path/to/GoogleNews-vectors-negative300.bin --binary BINARY --path /word2vec --host 0.0.0.0 --port 5000
- Example calls
curl http://127.0.0.1:5000/word2vec/n_similarity?ws1=Sushi&ws1=Shop&ws2=Japanese&ws2=Restaurant
curl http://127.0.0.1:5000/word2vec/similarity?w1=Sushi&w2=Japanese
curl http://127.0.0.1:5000/word2vec/most_similar?positive=indian&positive=food[&negative=][&topn=]
curl http://127.0.0.1:5000/word2vec/model?word=restaurant
curl http://127.0.0.1:5000/word2vec/model_word_set
Note: The "model" method returns a base64 encoding of the vector. "model_word_set" returns a base64 encoded pickle of the model's vocabulary.
In case you do not have domain specific data to train, it can be convenient to use a pretrained model. Please feel free to submit additions to this list through a pull request.
Model file | Number of dimensions | Corpus (size) | Vocabulary size | Author | Architecture | Training Algorithm | Context window - size | Web page |
---|---|---|---|---|---|---|---|---|
Google News | 300 | Google News (100B) | 3M | word2vec | negative sampling | BoW - ~5 | link | |
Freebase IDs | 1000 | Gooogle News (100B) | 1.4M | word2vec, skip-gram | ? | BoW - ~10 | link | |
Freebase names | 1000 | Gooogle News (100B) | 1.4M | word2vec, skip-gram | ? | BoW - ~10 | link | |
Wikipedia+Gigaword 5 | 50 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | link |
Wikipedia+Gigaword 5 | 100 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | link |
Wikipedia+Gigaword 5 | 200 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | link |
Wikipedia+Gigaword 5 | 300 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | link |
Common Crawl 42B | 300 | Common Crawl (42B) | 1.9M | GloVe | GloVe | GloVe | AdaGrad | link |
Common Crawl 840B | 300 | Common Crawl (840B) | 2.2M | GloVe | GloVe | GloVe | AdaGrad | link |
Twitter (2B Tweets) | 25 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | link |
Twitter (2B Tweets) | 50 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | link |
Twitter (2B Tweets) | 100 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | link |
Twitter (2B Tweets) | 200 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | link |
Wikipedia dependency | 300 | Wikipedia (?) | 174,015 | Levy & Goldberg | word2vec modified | word2vec | syntactic dependencies | link |
DBPedia vectors (wiki2vec) | 1000 | Wikipedia (?) | ? | Idio | word2vec | word2vec, skip-gram | BoW, 10 | link |
60 Wikipedia embeddings with 4 kinds of context | 25,50,100,250,500 | Wikipedia | varies | Li, Liu et al. | Skip-Gram, CBOW, GloVe | original and modified | 2 | link |