Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Bug for subwords related to utf-8 #8

Open
linetor opened this issue Oct 21, 2019 · 3 comments
Open

Bug for subwords related to utf-8 #8

linetor opened this issue Oct 21, 2019 · 3 comments

Comments

@linetor
Copy link

linetor commented Oct 21, 2019

Hi, I'm Korean developer and I have using your library well.
But when I trained my model by fasttext model with subgram like minn-3 and maxn-6, model's prediction is different between original library(vi python) and your library(java).
And I debugged the situation, and I found the reason is charMatches.
I found you rewrote cpp code to java code.
The line Original code ( https://github.com/facebookresearch/fastText/blob/0c6db7c2d6ba9e0ff81713ed9f50c3142e4ba700/src/dictionary.cc#L172-L195 ) 's char input( by using string index) is byte. So it need to find is it 3 byte(like Korean or Japanese) or 1 byte(like number or English)
But java's char input is not byte. In java, we can easily get the one char(like Korean) not byte, so we don't need to compare byte for getting char( like & 0xC0) == 0x80 ). At word, it cause bug.
So I think you need to change your code like removing line containing charMatches function (at

protected void computeSubwords(String word, List<Integer> ngrams, List<String> substrings) {
for(int i = 0; i < word.length(); i++) {
StringBuilder ngram = new StringBuilder();
if (!charMatches(word.charAt(i))) {
for (int j = i, n = 1; j < word.length() && n <= args.getMaxn(); n++) {
ngram.append(word.charAt(j++));
while (j < word.length() && charMatches(word.charAt(j))) {
ngram.append(word.charAt(j++));
}
if (n >= args.getMinn() && !(n == 1 && (i == 0 || j == word.length()))) {
UnsignedLong h = UnsignedLong.valueOf(hash(ngram.toString()));
h = h.mod(UnsignedLong.valueOf(args.getBucketNumber()));
ngrams.add(nWords + h.intValue());
substrings.add(ngram.toString());
}
}
}
}
}
or may be more ?)

And I hope You manage this. Tell me if you need anything.
Thanks

@devanandj
Copy link

Hey linetor. Did you manage to fix that issue? I'm facing the same issue but I don't know Korean so I am finding it difficult to debug. Do let me know the fix if possible. Thanks.

@linetor
Copy link
Author

linetor commented Oct 15, 2020

Hi, devanandj.
I just changed it by below, but it does not guarantee to do properly (To me, there is no exception.)
I hope this will help you

protected boolean charMatches(char c) {
    return false;
  }
  protected void computeSubwords(String word, List<Integer> ngrams, List<String> substrings) {
    for(int i = 0; i < word.length(); i++) {
      StringBuilder ngram = new StringBuilder();
      //if (!charMatches(word.charAt(i))) {
        for (int j = i, n = 1; j < word.length() && n <= args.getMaxn(); n++) {
          ngram.append(word.charAt(j++));
          while (j < word.length() && charMatches(word.charAt(j))) {
            ngram.append(word.charAt(j++));
          }
          if (n >= args.getMinn() && !(n == 1 && (i == 0 || j == word.length()))) {
            UnsignedLong h = UnsignedLong.valueOf(hash(ngram.toString()));
            h = h.mod(UnsignedLong.valueOf(args.getBucketNumber()));
            ngrams.add(nWords + h.intValue());
            substrings.add(ngram.toString());
          }
        }
      //}
    }
  }

@mondoblu
Copy link

mondoblu commented Dec 16, 2022

This library is buggy for UTF-8 strings, as the porting from C++ to Java has been implemented not considering that the type 'string' in C++ is byte oriented, but the type 'String' in java is character oriented, with characters encoded internally as UTF-16.

Removing charMatches() does not solve the issue with UTF-8 strings, as the problem here is the buggy porting from C++ to Java that requires to rewrite part of the code, starting from computeSubwords().

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants