Bug for subwords related to utf-8 #8

linetor · 2019-10-21T07:37:58Z

Hi, I'm Korean developer and I have using your library well.
But when I trained my model by fasttext model with subgram like minn-3 and maxn-6, model's prediction is different between original library(vi python) and your library(java).
And I debugged the situation, and I found the reason is charMatches.
I found you rewrote cpp code to java code.
The line Original code ( https://github.com/facebookresearch/fastText/blob/0c6db7c2d6ba9e0ff81713ed9f50c3142e4ba700/src/dictionary.cc#L172-L195 ) 's char input( by using string index) is byte. So it need to find is it 3 byte(like Korean or Japanese) or 1 byte(like number or English)
But java's char input is not byte. In java, we can easily get the one char(like Korean) not byte, so we don't need to compare byte for getting char( like & 0xC0) == 0x80 ). At word, it cause bug.
So I think you need to change your code like removing line containing charMatches function (at

fastText4j/src/main/java/fasttext/BaseDictionary.java

Lines 321 to 339 in c3eb898

    
           protected void computeSubwords(String word, List<Integer> ngrams, List<String> substrings) { 
        
             for(int i = 0; i < word.length(); i++) { 
        
               StringBuilder ngram = new StringBuilder(); 
        
               if (!charMatches(word.charAt(i))) { 
        
                 for (int j = i, n = 1; j < word.length() && n <= args.getMaxn(); n++) { 
        
                   ngram.append(word.charAt(j++)); 
        
                   while (j < word.length() && charMatches(word.charAt(j))) { 
        
                     ngram.append(word.charAt(j++)); 
        
                   } 
        
                   if (n >= args.getMinn() && !(n == 1 && (i == 0 || j == word.length()))) { 
        
                     UnsignedLong h = UnsignedLong.valueOf(hash(ngram.toString())); 
        
                     h = h.mod(UnsignedLong.valueOf(args.getBucketNumber())); 
        
                     ngrams.add(nWords + h.intValue()); 
        
                     substrings.add(ngram.toString()); 
        
                   } 
        
                 } 
        
               } 
        
             } 
        
           }

or may be more ?)

And I hope You manage this. Tell me if you need anything.
Thanks

The text was updated successfully, but these errors were encountered:

devanandj · 2020-10-12T03:38:08Z

Hey linetor. Did you manage to fix that issue? I'm facing the same issue but I don't know Korean so I am finding it difficult to debug. Do let me know the fix if possible. Thanks.

linetor · 2020-10-15T08:01:30Z

Hi, devanandj.
I just changed it by below, but it does not guarantee to do properly (To me, there is no exception.)
I hope this will help you

protected boolean charMatches(char c) {
    return false;
  }
  protected void computeSubwords(String word, List<Integer> ngrams, List<String> substrings) {
    for(int i = 0; i < word.length(); i++) {
      StringBuilder ngram = new StringBuilder();
      //if (!charMatches(word.charAt(i))) {
        for (int j = i, n = 1; j < word.length() && n <= args.getMaxn(); n++) {
          ngram.append(word.charAt(j++));
          while (j < word.length() && charMatches(word.charAt(j))) {
            ngram.append(word.charAt(j++));
          }
          if (n >= args.getMinn() && !(n == 1 && (i == 0 || j == word.length()))) {
            UnsignedLong h = UnsignedLong.valueOf(hash(ngram.toString()));
            h = h.mod(UnsignedLong.valueOf(args.getBucketNumber()));
            ngrams.add(nWords + h.intValue());
            substrings.add(ngram.toString());
          }
        }
      //}
    }
  }

mondoblu · 2022-12-16T11:42:00Z

This library is buggy for UTF-8 strings, as the porting from C++ to Java has been implemented not considering that the type 'string' in C++ is byte oriented, but the type 'String' in java is character oriented, with characters encoded internally as UTF-16.

Removing charMatches() does not solve the issue with UTF-8 strings, as the problem here is the buggy porting from C++ to Java that requires to rewrite part of the code, starting from computeSubwords().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug for subwords related to utf-8 #8

Bug for subwords related to utf-8 #8

linetor commented Oct 21, 2019

devanandj commented Oct 12, 2020

linetor commented Oct 15, 2020 •

edited

Loading

mondoblu commented Dec 16, 2022 •

edited

Loading

Bug for subwords related to utf-8 #8

Bug for subwords related to utf-8 #8

Comments

linetor commented Oct 21, 2019

devanandj commented Oct 12, 2020

linetor commented Oct 15, 2020 • edited Loading

mondoblu commented Dec 16, 2022 • edited Loading

linetor commented Oct 15, 2020 •

edited

Loading

mondoblu commented Dec 16, 2022 •

edited

Loading