-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Bug for subwords related to utf-8 #8
Comments
Hey linetor. Did you manage to fix that issue? I'm facing the same issue but I don't know Korean so I am finding it difficult to debug. Do let me know the fix if possible. Thanks. |
Hi, devanandj.
|
This library is buggy for UTF-8 strings, as the porting from C++ to Java has been implemented not considering that the type 'string' in C++ is byte oriented, but the type 'String' in java is character oriented, with characters encoded internally as UTF-16. Removing charMatches() does not solve the issue with UTF-8 strings, as the problem here is the buggy porting from C++ to Java that requires to rewrite part of the code, starting from computeSubwords(). |
Hi, I'm Korean developer and I have using your library well.
But when I trained my model by fasttext model with subgram like minn-3 and maxn-6, model's prediction is different between original library(vi python) and your library(java).
And I debugged the situation, and I found the reason is
charMatches
.I found you rewrote cpp code to java code.
The line Original code ( https://github.com/facebookresearch/fastText/blob/0c6db7c2d6ba9e0ff81713ed9f50c3142e4ba700/src/dictionary.cc#L172-L195 ) 's char input( by using string index) is byte. So it need to find is it 3 byte(like Korean or Japanese) or 1 byte(like number or English)
But java's char input is not byte. In java, we can easily get the one char(like Korean) not byte, so we don't need to compare byte for getting char( like & 0xC0) == 0x80 ). At word, it cause bug.
So I think you need to change your code like removing line containing charMatches function (at
fastText4j/src/main/java/fasttext/BaseDictionary.java
Lines 321 to 339 in c3eb898
And I hope You manage this. Tell me if you need anything.
Thanks
The text was updated successfully, but these errors were encountered: