Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

jflex warnings in multiple lex files about too long Unicode escape sequences #4734

Open
vladak opened this issue Feb 26, 2025 · 3 comments
Open

Comments

@vladak
Copy link
Member

vladak commented Feb 26, 2025

Warning in file "/home/vkotal/opengrok-vladak-scratch/opengrok-indexer/src/main/jflex/analysis/php/PhpXref.lex" (line 139): 
Unicode escape sequence is too long. Use \u{...} to disambiguate.
Identifier = [a-zA-Z_\u007F-\u10FFFF] [a-zA-Z0-9_\u007F-\u10FFFF]*
                                  ^

Warning in file "/home/vkotal/opengrok-vladak-scratch/opengrok-indexer/src/main/jflex/analysis/php/PhpXref.lex" (line 139): 
Unicode escape sequence is too long. Use \u{...} to disambiguate.
Identifier = [a-zA-Z_\u007F-\u10FFFF] [a-zA-Z0-9_\u007F-\u10FFFF]*
                                                              ^

Warning in file "/home/vkotal/opengrok-vladak-scratch/opengrok-indexer/src/main/jflex/analysis/php/PhpXref.lex" (line 175): 
Unicode escape sequence is too long. Use \u{...} to disambiguate.
HtmlNameStart = [a-zA-Z_\u00C0-\u10FFFFFF]
                                     ^
[INFO]   generated /home/vkotal/opengrok-vladak-scratch/opengrok-indexer/target/generated-sources/jflex/org/opengrok/indexer/analysis/php/PhpXref.java

Warning in file "/home/vkotal/opengrok-vladak-scratch/opengrok-indexer/src/main/jflex/analysis/php/PhpSymbolTokenizer.lex" (line 75): 
Unicode escape sequence is too long. Use \u{...} to disambiguate.
Identifier = [a-zA-Z_\u007F-\u10FFFF] [a-zA-Z0-9_\u007F-\u10FFFF]*
                                  ^

Warning in file "/home/vkotal/opengrok-vladak-scratch/opengrok-indexer/src/main/jflex/analysis/php/PhpSymbolTokenizer.lex" (line 75): 
Unicode escape sequence is too long. Use \u{...} to disambiguate.
Identifier = [a-zA-Z_\u007F-\u10FFFF] [a-zA-Z0-9_\u007F-\u10FFFF]*
                                                              ^
@vladak
Copy link
Member Author

vladak commented Feb 26, 2025

The limit in JFlex is either 4 or 6 characters in the sequence (LexScan.flex):

  {Unicode4}  { maybeWarnUnicodeMatch(4);
                string.append( (char) Integer.parseInt(yytext().substring(2,6), 16));
              }
  {Unicode6}  { maybeWarnUnicodeMatch(6);
                int codePoint = Integer.parseInt(yytext().substring(2,8), 16);
                if (codePoint <= getMaximumCodePoint()) {
                  string.append(Character.toChars(codePoint));
                } else {
                  throw new ScannerException(file,ErrorMessages.CODEPOINT_OUT_OF_RANGE, yyline, yycolumn+2);
                }
              }

  "\\u{"      { yybegin(STRING_CODEPOINT_SEQUENCE); }

The maybeWarnUnicodeMatch() is the call which emits the warning:

  /**
   * Warn if the matched length of a Unicode escape sequence is longer than expected. Push back the
   * extra characters to be matched again.
   *
   * @param len expected Unicode escape sequence length
   */
  public void maybeWarnUnicodeMatch(int len) {
    // 2 for "\"" followed by "u" or "U" at start of match
    len += 2;
    if (lexLength() > len) {
      Out.warning(file, ErrorMessages.UNICODE_TOO_LONG, lexLine(), lexColumn() + len);
      lexPushback(lexLength() - len);
    }
  }

so it basically allows up to 4 characters after the \u, not 6.

@vladak
Copy link
Member Author

vladak commented Feb 26, 2025

The Java SE 17 documentation says in section 3.3 Unicode Escapes:

One Unicode escape can represent characters in the range U+0000 to U+FFFF; representing supplementary characters in the range U+010000 to U+10FFFF requires two consecutive Unicode escapes.

@vladak
Copy link
Member Author

vladak commented Feb 26, 2025

Looks like JFlex does not support using 2 consecutive Unicode escapes to specify character range in a regexp. Anyhow, given that it most likely used \u007F-\u10FF (sans the 3rd FF byte) up till now and given that The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP) is most likely sufficient for the identifiers and such, it could be fine to use that upper bound.

vladak added a commit to vladak/OpenGrok that referenced this issue Feb 26, 2025
vladak added a commit to vladak/OpenGrok that referenced this issue Feb 26, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant