Replace some manual script detection with regular expressions #4541

1ec5 · 2024-08-12T01:13:45Z

For several years, I’ve been manually maintaining a relevant subset of the Unicode Character Database that GL JS uses to detect scripts for vertical layout, ideographic line breaking, complex text layout, and right-to-left layout. This solution works reasonably efficiently, but it’s difficult to keep it up-to-date as Unicode evolves, because relatively few contributors are familiar with these parts of the Unicode standard or the associated text layout conventions.

I think the custom script detection logic should be replaced by Unicode-aware regular expressions where possible. I’ve identified two such opportunities; there may be others.

It should be possible to replace the following function:

maplibre-gl-js/src/util/script_detection.ts

Lines 268 to 275 in 007ad7b

    
           export function charInComplexShapingScript(char: number) { 
        
               return isChar['Arabic'](char) || 
        
                      isChar['Arabic Supplement'](char) || 
        
                      isChar['Arabic Extended-A'](char) || 
        
                      isChar['Arabic Extended-B'](char) || 
        
                      isChar['Arabic Presentation Forms-A'](char) || 
        
                      isChar['Arabic Presentation Forms-B'](char); 
        
           }

with the following code:

export function charInComplexShapingScript(char: number) {
    return /\p{Sc=Arab}/u.test(String.fromCodePoint(char));
}

and replace the following function:

maplibre-gl-js/src/util/script_detection.ts

Lines 37 to 63 in 007ad7b

    
           export function charAllowsIdeographicBreaking(char: number) { 
        
               // Return early for characters outside all ideographic ranges. 
        
               if (char < 0x2E80) return false; 
        
               if (isChar['Bopomofo Extended'](char)) return true; 
        
               if (isChar['Bopomofo'](char)) return true; 
        
               if (isChar['CJK Compatibility Forms'](char)) return true; 
        
               if (isChar['CJK Compatibility Ideographs'](char)) return true; 
        
               if (isChar['CJK Compatibility'](char)) return true; 
        
               if (isChar['CJK Radicals Supplement'](char)) return true; 
        
               if (isChar['CJK Strokes'](char)) return true; 
        
               if (isChar['CJK Symbols and Punctuation'](char)) return true; 
        
               if (isChar['CJK Unified Ideographs Extension A'](char)) return true; 
        
               if (isChar['CJK Unified Ideographs'](char)) return true; 
        
               if (isChar['Enclosed CJK Letters and Months'](char)) return true; 
        
               if (isChar['Halfwidth and Fullwidth Forms'](char)) return true; 
        
               if (isChar['Hiragana'](char)) return true; 
        
               if (isChar['Ideographic Description Characters'](char)) return true; 
        
               if (isChar['Kangxi Radicals'](char)) return true; 
        
               if (isChar['Katakana Phonetic Extensions'](char)) return true; 
        
               if (isChar['Katakana'](char)) return true; 
        
               if (isChar['Vertical Forms'](char)) return true; 
        
               if (isChar['Yi Radicals'](char)) return true; 
        
               if (isChar['Yi Syllables'](char)) return true; 
        
               return false; 
        
           }

with the following code:

export function charAllowsIdeographicBreaking(char: number) {
    return /\p{Ideo}/u.test(String.fromCodePoint(char));
}

This will make the functions more future-proof, so that we don’t need to notice that a new CJK-related block has been added to Unicode etc. The downside is that this requires a relatively recent version of ECMAScript. This codebase has never used a Unicode regular expression so far, let alone a Unicode character escape. Does GL JS still maintain capability with Chrome 63, Firefox 77, or Safari 11.0 or earlier versions of these browsers?

Unfortunately, it isn’t currently possible to replace charHasUprightVerticalOrientation(), stringContainsRTLText(), or unicodeBlockLookup with something more robust, because the ECMAScript standard doesn’t yet allow a regular expression to match on the Vertical_Orientation, Bidi_Class, or Block property, respectively.

The text was updated successfully, but these errors were encountered:

HarelM · 2024-08-12T05:34:48Z

There's no need to support old browsers. Newer version of maplibre will support new browsers while old versions can support old browsers (given that the old version does not have a "big" market share).
I think the suggestion above is a good simplification and moves the responsibility to regexp, which is better than hardcoding stuff like it is right now.

1ec5 mentioned this issue Aug 12, 2024

Implement Unicode combining characters #2388

Closed

This was referenced Aug 13, 2024

Render non-BMP CJKV characters locally #4550

Open

Simplify script detection using regular expressions #4560

Merged

HarelM closed this as completed in #4560 Aug 15, 2024

1ec5 mentioned this issue Aug 27, 2024

Render complex text, variant forms, emoji, etc. 1ec5/maplibre-gl-js#1

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace some manual script detection with regular expressions #4541

Replace some manual script detection with regular expressions #4541

1ec5 commented Aug 12, 2024 •

edited

Loading

HarelM commented Aug 12, 2024

Replace some manual script detection with regular expressions #4541

Replace some manual script detection with regular expressions #4541

Comments

1ec5 commented Aug 12, 2024 • edited Loading

HarelM commented Aug 12, 2024

1ec5 commented Aug 12, 2024 •

edited

Loading