Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Pattern panics non-Unicode bytes::Regex #367

Closed
l4AgXc opened this issue May 20, 2017 · 1 comment · Fixed by #368
Closed

Pattern panics non-Unicode bytes::Regex #367

l4AgXc opened this issue May 20, 2017 · 1 comment · Fixed by #368
Labels

Comments

@l4AgXc
Copy link

l4AgXc commented May 20, 2017

I am using rustc version 1.14.0 and regex version 0.2.1. I found a pattern that panics a non-Unicode bytes::Regex:

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/libcore/option.rs:323

This program demonstrates the panic:

extern crate regex;
fn main() {
    let _ = regex::bytes::Regex::new(r"(?-u).[.]\S\w\x00\x02\x03\x05\x06\x08\x0c\x0e\x10\x12\x14\x16\x17\x19\x1a\x1c\x1e\x21\x23\x25\x27\x28\x2a\x2c\x31\x33\x35\x36\x38\x3b\x3d\x3e\x40\x42\x44\x46\x48\x4a\x4c\x4d\x4f\x50\x52\x53\x55\x56\x58\x59\x5b\x5d\x61\x63\x65\x67\x69\x6a\x6c\x6e\x6f\x70\x72\x74\x76\x77\x79\x7c\x7e\x81\x82\x84\x86\x88\x89\x8b\x8c\x8e\x91\x93\x95\x97\x99\x9b\x9d\x9e\xa1\xa3\xa5\xa6\xa8\xaa\xac\xad\xaf\xb0\xb2\xb4\xb5\xb7\xb9\xbb\xbd\xbe\xc0\xc3\xc5\xc7\xc9\xcb\xcc\xce\xcf\xd1\xd3\xd4\xd6\xd7\xd9\xdb\xdc\xde\xe2\xe3\xe5\xe6\xe8\xe9\xeb\xee\xf1\xf3\xf5\xf7\xf9\xfb\xfc\xfe");
}

The pattern consists of 136 unique literal byte values, plus the sub-patterns ., [.], \S, and \w. Here is what I have been able to find out:

  • The pattern is close to minimal. Removing any of the 136 literal bytes, or any of the four sub-patterns, avoids the panic. You can add to the pattern anywhere and it still panics.
  • Order doesn't matter. I constructed the pattern by starting with a much larger bytes::RegexSetBuilder that panicked, removing as much as I could while still having it panic, and sorting.
  • The panic isn't related to backslash escapes. I.e., you can replace \x44 with a literal D and it still panics.
  • Some modifications avoid the panic and some do not. For example, changing \x00 to \x01 still panics, but changing \xde to \xdf does not. Changing \w to \b still panics, but changing \w to \d does not.
  • Non-Unicode is necessary. If you omit the (?-u) (or unicode(false) when using a RegexBuilder or RegexSetBuilder), then it does not panic.

The same thing happens with bytes::RegexBuilder, bytes::RegexSet, and bytes::RegexSetBuilder. The 140 necessary elements can be distributed across multiple patterns when using a builder. Here is an example of a bytes::RegexSetBuilder that panics:

extern crate regex;
fn main() {
    let mut rsb = regex::bytes::RegexSetBuilder::new([
        r"\xa3\xd7\x40\x95\x59\xd4\x2a\x86\x93\xaf",
        r"\x16\xa1\x14\x19\x00\x2c\x27\xcc\x10\xcb\xee\xf5\xeb\xfb\xb5\xd9\x46\x25\x23\x38\x36\x35\x56\x31\x4a\x44\x4c\x99\xc7\x9d\x3d\w\xc0\x9b\x3b\x12\xdb\x89",
        r"\x84",
        r"\xcf\x8c",
        r"\x7c\xbd\x97\xfc\x3e\x6c\x79\x7e\xc3\x9e\x5b\x42\xf3\x17\x06\x08\xc5\xac\x05\x53\xe9",
        r"\xdc\xc9\x8b\x1a\x02\x1c\x76\x6a\xd3\xb4\x91\x0c\x1e\x03\x70\x77\x55\x52\x0e",
        r"\xde\xb2\xad.\x8e\x88\xd6\x81\xf9\xb7\xfe\xce\xf7\xb0\xe6",
        r"\xd1\x4d\x72\x6f\x74\x63\x61\x6e\x67\x65\x50\x4f\x33\xe2\xe5\xf1\xe8[.]\xe3",
        r"\xa6\xbe\xb9\xaa\xbb\x28\x69\xa5\x48\xa8",
        r"\S\x5d\x21\x58\x82",
    ].iter());
    rsb.unicode(false);
    match rsb.build() {
        Ok(_) => println!("ok"),
        Err(e) => println!("error {}", e),
    };
}

Here are all 140 elements of the pattern in order:

.
[.]
\S
\w
\x00
\x02
\x03
\x05
\x06
\x08
\x0c
\x0e
\x10
\x12
\x14
\x16
\x17
\x19
\x1a
\x1c
\x1e
\x21
\x23
\x25
\x27
\x28
\x2a
\x2c
\x31
\x33
\x35
\x36
\x38
\x3b
\x3d
\x3e
\x40
\x42
\x44
\x46
\x48
\x4a
\x4c
\x4d
\x4f
\x50
\x52
\x53
\x55
\x56
\x58
\x59
\x5b
\x5d
\x61
\x63
\x65
\x67
\x69
\x6a
\x6c
\x6e
\x6f
\x70
\x72
\x74
\x76
\x77
\x79
\x7c
\x7e
\x81
\x82
\x84
\x86
\x88
\x89
\x8b
\x8c
\x8e
\x91
\x93
\x95
\x97
\x99
\x9b
\x9d
\x9e
\xa1
\xa3
\xa5
\xa6
\xa8
\xaa
\xac
\xad
\xaf
\xb0
\xb2
\xb4
\xb5
\xb7
\xb9
\xbb
\xbd
\xbe
\xc0
\xc3
\xc5
\xc7
\xc9
\xcb
\xcc
\xce
\xcf
\xd1
\xd3
\xd4
\xd6
\xd7
\xd9
\xdb
\xdc
\xde
\xe2
\xe3
\xe5
\xe6
\xe8
\xe9
\xeb
\xee
\xf1
\xf3
\xf5
\xf7
\xf9
\xfb
\xfc
\xfe
BurntSushi added a commit to BurntSushi/regex that referenced this issue May 20, 2017
The code that computes the byte classes had a bug that was only tripped
when the maximum number of equivalence classes was necessary (256). This
commit fixes that by rewriting the loop such that we never uselessly
increment outside the bounds of `u8`.

Fixes rust-lang#367
@BurntSushi
Copy link
Member

Nice find. A "minimal" (or perhaps, more straight forward) reproduction is the following:

extern crate regex;

fn main() {
    let mut pat = "(?-u)".to_string();
    for i in 0..256 {
        let hexbyte = format!(r"\x{:02x}", i);
        pat.push_str(&hexbyte);
    }
    let _ = regex::bytes::Regex::new(&pat);
}

The underlying cause of this was the "byte class optimization." In particular, the optimization computes equivalence classes of bytes that are used to transition the DFA, since there is typically a much smaller number of equivalence classes than 256, which means the transition table in memory is much smaller. However, the part of the compiler that computed the equivalence classes tripped over a bug when the number of equivalence classes reached its maximum (256):

        let mut byte_classes = vec![0; 256];
        let mut class = 0u8;
        for i in 0..256 {
            byte_classes[i] = class as u8;
            if self.0[i] {
                class = class.checked_add(1).unwrap();
            }
        }
        byte_classes

In this case, on the last iteration of the loop, the class = class.checked_add(1).unwrap() would uselessly execute and overflow class (which is a u8), and is therefore what caused the panic.

I have a fix for this in #368. Thanks for the report!

@BurntSushi BurntSushi added the bug label May 20, 2017
bors added a commit that referenced this issue May 20, 2017
compiler: fix a byte class bug

The code that computes the byte classes had a bug that was only tripped
when the maximum number of equivalence classes was necessary (256). This
commit fixes that by rewriting the loop such that we never uselessly
increment outside the bounds of `u8`.

Fixes #367
@bors bors closed this as completed in #368 May 20, 2017
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants