Consideration for Perl-like (?[])
extended character classes instead of a flag #39
Description
I've been researching regular expression syntax in various languages and engines to inform possible future proposals to expand the ECMAScript regular expression syntax. One of the features I've been reviewing is Perl's Extended Bracketed Character Classes, which support operations such as:
- Intersection (
&
) - Union (
+
or|
) - Subtraction (
-
) - Symmetric Difference (
^
) - Complement (
!
) - Grouping (
(
,)
)
In this case, such a character class uses the tokens (?[
and ])
. The contents of the expression can contain the above tokens, whitespace (which is ignored), character classes, metacharacters (such as \p{..}
, \s
, etc.), and certain escape sequences (such as \x0a
, etc.). This allows you to write complex character classes like the following (based on the examples in the explainer):
# non-ASCI digits
(?[ \p{Decimal_Number} - [0-9] ])
# spans of word/identifier letters of specific scripts
(?[ \p{Script=Khmer} & [\p{Letter}\p{Mark}\p{Number}] ])
# breaking spaces
(?[ \p{White_Space} - \p{Line_Break=Glue} ])
# non-ASCII emoji
(?[ \p{Emoji} - \p{ASCII} ])
As well as classes like the following (from the perlre documentation):
# Matches digits in the Thai or Laotian scripts
(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])
Currently, (?[
is not valid RegExp syntax (with or without the u
flag), so it provides an opportunity to add syntax to cover set notation functionality without needing to introduce a new flag.