Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Respect encoding when reading binary keyed vectors #3309

Conversation

alhoo
Copy link
Contributor

@alhoo alhoo commented Mar 20, 2022

Current implementation fails to read keyed vectors that have iso-8859-1
encoding in the words when encoded in binary format. An example of this
type of a file can be seen in the turkuNLP finnish embeddings:

http://dl.turkunlp.org/finnish-embeddings/finnish_s24_skgram.bin

This file is quite trivial to load by passing the encoding to the vector
loading function. It is also logical that when user asks

KeyedVectors.load_word2vec_format(filename, binary=True, encoding='iso-8859-1')

The library would try to load the file assuming that the matrix is in
binary format and the words are encoded using iso-8859-1 encoding.

Current implementation fails to read keyed vectors that have iso-8859-1
encoding in the words when encoded in binary format. An example of this
type of a file can be seen in the turkuNLP finnish embeddings:

http://dl.turkunlp.org/finnish-embeddings/finnish_s24_skgram.bin

This file is quite trivial to load by passing the encoding to the vector
loading function. It is also logical that when user asks

KeyedVectors.load_word2vec_format(filename, binary=True, encoding='iso-8859-1')

The library would try to load the file assuming that the matrix is in
binary format and the words are encoded using iso-8859-1 encoding.
@piskvorky
Copy link
Owner

piskvorky commented Mar 20, 2022

Makes sense, thanks.

If this works and all tests pass, let's include this PR in the upcoming release @mpenkov .

@piskvorky piskvorky added this to the Next release milestone Mar 20, 2022
@piskvorky piskvorky self-assigned this Mar 20, 2022
@codecov
Copy link

codecov bot commented Mar 20, 2022

Codecov Report

Merging #3309 (c59a412) into develop (ac3bbcd) will decrease coverage by 1.89%.
The diff coverage is 100.00%.

❗ Current head c59a412 differs from pull request most recent head 17c6cf0. Consider uploading reports for the commit 17c6cf0 to get more accurate results

@@             Coverage Diff             @@
##           develop    #3309      +/-   ##
===========================================
- Coverage    81.43%   79.53%   -1.90%     
===========================================
  Files          122       68      -54     
  Lines        21052    11783    -9269     
===========================================
- Hits         17144     9372    -7772     
+ Misses        3908     2411    -1497     
Impacted Files Coverage Δ
gensim/models/keyedvectors.py 82.68% <100.00%> (+0.44%) ⬆️
gensim/scripts/glove2word2vec.py 76.19% <0.00%> (-7.15%) ⬇️
gensim/corpora/wikicorpus.py 93.75% <0.00%> (-1.04%) ⬇️
gensim/matutils.py 77.23% <0.00%> (-0.90%) ⬇️
gensim/similarities/docsim.py 23.95% <0.00%> (-0.76%) ⬇️
gensim/models/rpmodel.py 89.47% <0.00%> (-0.53%) ⬇️
gensim/models/ldamulticore.py 90.58% <0.00%> (-0.33%) ⬇️
gensim/utils.py 71.86% <0.00%> (-0.12%) ⬇️
gensim/corpora/dictionary.py 94.17% <0.00%> (-0.09%) ⬇️
gensim/models/hdpmodel.py 71.27% <0.00%> (-0.08%) ⬇️
... and 91 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de9ee81...17c6cf0. Read the comment docs.

@piskvorky
Copy link
Owner

One test failed, can you check please @alhoo?

@piskvorky piskvorky merged commit 91175dd into piskvorky:develop Apr 15, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants