Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

doc clarification: confusing match behavior for non-existent ASCII character classes #1234

Open
dawnofmidnight opened this issue Oct 27, 2024 · 1 comment
Labels

Comments

@dawnofmidnight
Copy link

dawnofmidnight commented Oct 27, 2024

Crate version: 1.11.0
Example code: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=c4b4cfe18c2e6413444e53315de33b27 (used for snippets below and extra checks)

The behavior of the crate when trying to use the ASCII character class syntax [[:foo:]] with invalid character classes is somewhat confusing. A friend was trying to use [[:XID_Start:]] to check whether _ (underscore/low line) was included in the XID_Start character class (it's not), and was confused when it returned true.

let expr = regex::Regex::new(r"[[:XID_Start:]]").unwrap();
dbg!(expr.is_match("_")); // true

The correct syntax, \p{XID_Start}, does work correctly:

let correct = regex::Regex::new(r"\p{XID_Start}").unwrap();
dbg!(correct.is_match("a")); // true
dbg!(correct.is_match("1")); // false
dbg!(correct.is_match("_")); // false

It seems that when the class is invalid for an ASCII character class (regex § ASCII character classes), it falls back to marking any character present within the brackets as true:

dbg!(expr.is_match(":")); // true
dbg!(expr.is_match("X")); // true
dbg!(expr.is_match("x")); // false
dbg!(expr.is_match("a")); // true
dbg!(expr.is_match("b")); // false
dbg!(expr.is_match("[")); // false
dbg!(expr.is_match("]")); // false

I'm not entirely sure what regex is actually interpreting this sequence as, but, assuming this is intentional behavior, I think that it might be something that is worth documenting in the aforementioned section on ASCII character classes in the docs, as the behavior is not immediately intuitive.

@BurntSushi
Copy link
Member

Yes the behavior is unfortunate but intentional for compatibility with how other regex engines work. In retrospect, I would have rathered being a bit more strict here to produce errors for unrecognized classes.

I agree that adding a note to the docs about this would be a good idea.

@BurntSushi BurntSushi added the doc label Oct 27, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants