Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Replace some manual script detection with regular expressions #4541

Closed
1ec5 opened this issue Aug 12, 2024 · 1 comment · Fixed by #4560
Closed

Replace some manual script detection with regular expressions #4541

1ec5 opened this issue Aug 12, 2024 · 1 comment · Fixed by #4560

Comments

@1ec5
Copy link
Contributor

1ec5 commented Aug 12, 2024

For several years, I’ve been manually maintaining a relevant subset of the Unicode Character Database that GL JS uses to detect scripts for vertical layout, ideographic line breaking, complex text layout, and right-to-left layout. This solution works reasonably efficiently, but it’s difficult to keep it up-to-date as Unicode evolves, because relatively few contributors are familiar with these parts of the Unicode standard or the associated text layout conventions.

I think the custom script detection logic should be replaced by Unicode-aware regular expressions where possible. I’ve identified two such opportunities; there may be others.

It should be possible to replace the following function:

export function charInComplexShapingScript(char: number) {
return isChar['Arabic'](char) ||
isChar['Arabic Supplement'](char) ||
isChar['Arabic Extended-A'](char) ||
isChar['Arabic Extended-B'](char) ||
isChar['Arabic Presentation Forms-A'](char) ||
isChar['Arabic Presentation Forms-B'](char);
}

with the following code:

export function charInComplexShapingScript(char: number) {
    return /\p{Sc=Arab}/u.test(String.fromCodePoint(char));
}

and replace the following function:

export function charAllowsIdeographicBreaking(char: number) {
// Return early for characters outside all ideographic ranges.
if (char < 0x2E80) return false;
if (isChar['Bopomofo Extended'](char)) return true;
if (isChar['Bopomofo'](char)) return true;
if (isChar['CJK Compatibility Forms'](char)) return true;
if (isChar['CJK Compatibility Ideographs'](char)) return true;
if (isChar['CJK Compatibility'](char)) return true;
if (isChar['CJK Radicals Supplement'](char)) return true;
if (isChar['CJK Strokes'](char)) return true;
if (isChar['CJK Symbols and Punctuation'](char)) return true;
if (isChar['CJK Unified Ideographs Extension A'](char)) return true;
if (isChar['CJK Unified Ideographs'](char)) return true;
if (isChar['Enclosed CJK Letters and Months'](char)) return true;
if (isChar['Halfwidth and Fullwidth Forms'](char)) return true;
if (isChar['Hiragana'](char)) return true;
if (isChar['Ideographic Description Characters'](char)) return true;
if (isChar['Kangxi Radicals'](char)) return true;
if (isChar['Katakana Phonetic Extensions'](char)) return true;
if (isChar['Katakana'](char)) return true;
if (isChar['Vertical Forms'](char)) return true;
if (isChar['Yi Radicals'](char)) return true;
if (isChar['Yi Syllables'](char)) return true;
return false;
}

with the following code:

export function charAllowsIdeographicBreaking(char: number) {
    return /\p{Ideo}/u.test(String.fromCodePoint(char));
}

This will make the functions more future-proof, so that we don’t need to notice that a new CJK-related block has been added to Unicode etc. The downside is that this requires a relatively recent version of ECMAScript. This codebase has never used a Unicode regular expression so far, let alone a Unicode character escape. Does GL JS still maintain capability with Chrome 63, Firefox 77, or Safari 11.0 or earlier versions of these browsers?

Unfortunately, it isn’t currently possible to replace charHasUprightVerticalOrientation(), stringContainsRTLText(), or unicodeBlockLookup with something more robust, because the ECMAScript standard doesn’t yet allow a regular expression to match on the Vertical_Orientation, Bidi_Class, or Block property, respectively.

@HarelM
Copy link
Collaborator

HarelM commented Aug 12, 2024

There's no need to support old browsers. Newer version of maplibre will support new browsers while old versions can support old browsers (given that the old version does not have a "big" market share).
I think the suggestion above is a good simplification and moves the responsibility to regexp, which is better than hardcoding stuff like it is right now.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants