jflex warnings in multiple lex files about too long Unicode escape sequences #4734

vladak · 2025-02-26T09:32:18Z

Warning in file "/home/vkotal/opengrok-vladak-scratch/opengrok-indexer/src/main/jflex/analysis/php/PhpXref.lex" (line 139): 
Unicode escape sequence is too long. Use \u{...} to disambiguate.
Identifier = [a-zA-Z_\u007F-\u10FFFF] [a-zA-Z0-9_\u007F-\u10FFFF]*
                                  ^

Warning in file "/home/vkotal/opengrok-vladak-scratch/opengrok-indexer/src/main/jflex/analysis/php/PhpXref.lex" (line 139): 
Unicode escape sequence is too long. Use \u{...} to disambiguate.
Identifier = [a-zA-Z_\u007F-\u10FFFF] [a-zA-Z0-9_\u007F-\u10FFFF]*
                                                              ^

Warning in file "/home/vkotal/opengrok-vladak-scratch/opengrok-indexer/src/main/jflex/analysis/php/PhpXref.lex" (line 175): 
Unicode escape sequence is too long. Use \u{...} to disambiguate.
HtmlNameStart = [a-zA-Z_\u00C0-\u10FFFFFF]
                                     ^
[INFO]   generated /home/vkotal/opengrok-vladak-scratch/opengrok-indexer/target/generated-sources/jflex/org/opengrok/indexer/analysis/php/PhpXref.java

Warning in file "/home/vkotal/opengrok-vladak-scratch/opengrok-indexer/src/main/jflex/analysis/php/PhpSymbolTokenizer.lex" (line 75): 
Unicode escape sequence is too long. Use \u{...} to disambiguate.
Identifier = [a-zA-Z_\u007F-\u10FFFF] [a-zA-Z0-9_\u007F-\u10FFFF]*
                                  ^

Warning in file "/home/vkotal/opengrok-vladak-scratch/opengrok-indexer/src/main/jflex/analysis/php/PhpSymbolTokenizer.lex" (line 75): 
Unicode escape sequence is too long. Use \u{...} to disambiguate.
Identifier = [a-zA-Z_\u007F-\u10FFFF] [a-zA-Z0-9_\u007F-\u10FFFF]*
                                                              ^

The text was updated successfully, but these errors were encountered:

vladak · 2025-02-26T09:47:29Z

The limit in JFlex is either 4 or 6 characters in the sequence (LexScan.flex):

  {Unicode4}  { maybeWarnUnicodeMatch(4);
                string.append( (char) Integer.parseInt(yytext().substring(2,6), 16));
              }
  {Unicode6}  { maybeWarnUnicodeMatch(6);
                int codePoint = Integer.parseInt(yytext().substring(2,8), 16);
                if (codePoint <= getMaximumCodePoint()) {
                  string.append(Character.toChars(codePoint));
                } else {
                  throw new ScannerException(file,ErrorMessages.CODEPOINT_OUT_OF_RANGE, yyline, yycolumn+2);
                }
              }

  "\\u{"      { yybegin(STRING_CODEPOINT_SEQUENCE); }

The maybeWarnUnicodeMatch() is the call which emits the warning:

  /**
   * Warn if the matched length of a Unicode escape sequence is longer than expected. Push back the
   * extra characters to be matched again.
   *
   * @param len expected Unicode escape sequence length
   */
  public void maybeWarnUnicodeMatch(int len) {
    // 2 for "\"" followed by "u" or "U" at start of match
    len += 2;
    if (lexLength() > len) {
      Out.warning(file, ErrorMessages.UNICODE_TOO_LONG, lexLine(), lexColumn() + len);
      lexPushback(lexLength() - len);
    }
  }

so it basically allows up to 4 characters after the \u, not 6.

vladak · 2025-02-26T10:38:49Z

The Java SE 17 documentation says in section 3.3 Unicode Escapes:

One Unicode escape can represent characters in the range U+0000 to U+FFFF; representing supplementary characters in the range U+010000 to U+10FFFF requires two consecutive Unicode escapes.

vladak · 2025-02-26T13:38:19Z

Looks like JFlex does not support using 2 consecutive Unicode escapes to specify character range in a regexp. Anyhow, given that it most likely used \u007F-\u10FF (sans the 3rd FF byte) up till now and given that The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP) is most likely sufficient for the identifiers and such, it could be fine to use that upper bound.

fixes oracle#4734

vladak added a commit to vladak/OpenGrok that referenced this issue Feb 26, 2025

use \uFFFF as upper range for PHP identifiers

c9aaf57

fixes oracle#4734

vladak added a commit to vladak/OpenGrok that referenced this issue Feb 26, 2025

use \uFFFF as upper range for PHP identifiers

cd1d203

fixes oracle#4734

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jflex warnings in multiple lex files about too long Unicode escape sequences #4734

jflex warnings in multiple lex files about too long Unicode escape sequences #4734

vladak commented Feb 26, 2025

vladak commented Feb 26, 2025 •

edited

Loading

vladak commented Feb 26, 2025 •

edited

Loading

vladak commented Feb 26, 2025

jflex warnings in multiple lex files about too long Unicode escape sequences #4734

jflex warnings in multiple lex files about too long Unicode escape sequences #4734

Comments

vladak commented Feb 26, 2025

vladak commented Feb 26, 2025 • edited Loading

vladak commented Feb 26, 2025 • edited Loading

vladak commented Feb 26, 2025

vladak commented Feb 26, 2025 •

edited

Loading

vladak commented Feb 26, 2025 •

edited

Loading