-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Russian word embedding models from RusVectores project #3
Comments
OK, let's try :-) |
@akutuzov no updates, only adding a new model, best scheme for support backward compatibility :) Thanks for the detailed info, only one thing: as I remember, RusVectores used |
Well, it can be any tagger supporting Russian and Universal Tags, do we really need to clutter the issue with the preprocessing details? |
@akutuzov This would be very desirable because this is not an obvious process (it is impossible to apply this model without pre-processing in the current case). Your code example will be linked with this model and simplify life for users :) |
OK. It will look somewhat like this with UDPipe. Models for various languages can be downloaded here.
This produces Universal PoS tags straight away.
With Mystem output, one will have to convert RNC tags to UPOS, using this conversion table. |
Thanks @akutuzov, sorry for waiting, now this repo released and import gensim.downloader as api
model = api.load("word2vec-ruscorpora-300") |
Thanks @menshikh-iv! One small fix: in the table, I see "License not found" for this model. However, we do have a license, it is Creative Commons Attribution 4.0 International :-). |
Sorry, i cant download the file, may you fix the download link above? |
Name: word2vec-ruscorpora-300
Link: http://rusvectores.org/static/models/ruscorpora_1_300_10.bin.gz
Description: Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words.
Related papers: https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models
Preprocessing: The corpus was lemmatized and tagged with Universal PoS.
Parameters: vector size 300, window size 10
Code example:
The text was updated successfully, but these errors were encountered: