Pattern panics non-Unicode `bytes::Regex` #367

l4AgXc · 2017-05-20T09:36:14Z

I am using rustc version 1.14.0 and regex version 0.2.1. I found a pattern that panics a non-Unicode bytes::Regex:

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/libcore/option.rs:323

This program demonstrates the panic:

extern crate regex;
fn main() {
    let _ = regex::bytes::Regex::new(r"(?-u).[.]\S\w\x00\x02\x03\x05\x06\x08\x0c\x0e\x10\x12\x14\x16\x17\x19\x1a\x1c\x1e\x21\x23\x25\x27\x28\x2a\x2c\x31\x33\x35\x36\x38\x3b\x3d\x3e\x40\x42\x44\x46\x48\x4a\x4c\x4d\x4f\x50\x52\x53\x55\x56\x58\x59\x5b\x5d\x61\x63\x65\x67\x69\x6a\x6c\x6e\x6f\x70\x72\x74\x76\x77\x79\x7c\x7e\x81\x82\x84\x86\x88\x89\x8b\x8c\x8e\x91\x93\x95\x97\x99\x9b\x9d\x9e\xa1\xa3\xa5\xa6\xa8\xaa\xac\xad\xaf\xb0\xb2\xb4\xb5\xb7\xb9\xbb\xbd\xbe\xc0\xc3\xc5\xc7\xc9\xcb\xcc\xce\xcf\xd1\xd3\xd4\xd6\xd7\xd9\xdb\xdc\xde\xe2\xe3\xe5\xe6\xe8\xe9\xeb\xee\xf1\xf3\xf5\xf7\xf9\xfb\xfc\xfe");
}

The pattern consists of 136 unique literal byte values, plus the sub-patterns ., [.], \S, and \w. Here is what I have been able to find out:

The pattern is close to minimal. Removing any of the 136 literal bytes, or any of the four sub-patterns, avoids the panic. You can add to the pattern anywhere and it still panics.
Order doesn't matter. I constructed the pattern by starting with a much larger bytes::RegexSetBuilder that panicked, removing as much as I could while still having it panic, and sorting.
The panic isn't related to backslash escapes. I.e., you can replace \x44 with a literal D and it still panics.
Some modifications avoid the panic and some do not. For example, changing \x00 to \x01 still panics, but changing \xde to \xdf does not. Changing \w to \b still panics, but changing \w to \d does not.
Non-Unicode is necessary. If you omit the (?-u) (or unicode(false) when using a RegexBuilder or RegexSetBuilder), then it does not panic.

The same thing happens with bytes::RegexBuilder, bytes::RegexSet, and bytes::RegexSetBuilder. The 140 necessary elements can be distributed across multiple patterns when using a builder. Here is an example of a bytes::RegexSetBuilder that panics:

extern crate regex;
fn main() {
    let mut rsb = regex::bytes::RegexSetBuilder::new([
        r"\xa3\xd7\x40\x95\x59\xd4\x2a\x86\x93\xaf",
        r"\x16\xa1\x14\x19\x00\x2c\x27\xcc\x10\xcb\xee\xf5\xeb\xfb\xb5\xd9\x46\x25\x23\x38\x36\x35\x56\x31\x4a\x44\x4c\x99\xc7\x9d\x3d\w\xc0\x9b\x3b\x12\xdb\x89",
        r"\x84",
        r"\xcf\x8c",
        r"\x7c\xbd\x97\xfc\x3e\x6c\x79\x7e\xc3\x9e\x5b\x42\xf3\x17\x06\x08\xc5\xac\x05\x53\xe9",
        r"\xdc\xc9\x8b\x1a\x02\x1c\x76\x6a\xd3\xb4\x91\x0c\x1e\x03\x70\x77\x55\x52\x0e",
        r"\xde\xb2\xad.\x8e\x88\xd6\x81\xf9\xb7\xfe\xce\xf7\xb0\xe6",
        r"\xd1\x4d\x72\x6f\x74\x63\x61\x6e\x67\x65\x50\x4f\x33\xe2\xe5\xf1\xe8[.]\xe3",
        r"\xa6\xbe\xb9\xaa\xbb\x28\x69\xa5\x48\xa8",
        r"\S\x5d\x21\x58\x82",
    ].iter());
    rsb.unicode(false);
    match rsb.build() {
        Ok(_) => println!("ok"),
        Err(e) => println!("error {}", e),
    };
}

Here are all 140 elements of the pattern in order:

.
[.]
\S
\w
\x00
\x02
\x03
\x05
\x06
\x08
\x0c
\x0e
\x10
\x12
\x14
\x16
\x17
\x19
\x1a
\x1c
\x1e
\x21
\x23
\x25
\x27
\x28
\x2a
\x2c
\x31
\x33
\x35
\x36
\x38
\x3b
\x3d
\x3e
\x40
\x42
\x44
\x46
\x48
\x4a
\x4c
\x4d
\x4f
\x50
\x52
\x53
\x55
\x56
\x58
\x59
\x5b
\x5d
\x61
\x63
\x65
\x67
\x69
\x6a
\x6c
\x6e
\x6f
\x70
\x72
\x74
\x76
\x77
\x79
\x7c
\x7e
\x81
\x82
\x84
\x86
\x88
\x89
\x8b
\x8c
\x8e
\x91
\x93
\x95
\x97
\x99
\x9b
\x9d
\x9e
\xa1
\xa3
\xa5
\xa6
\xa8
\xaa
\xac
\xad
\xaf
\xb0
\xb2
\xb4
\xb5
\xb7
\xb9
\xbb
\xbd
\xbe
\xc0
\xc3
\xc5
\xc7
\xc9
\xcb
\xcc
\xce
\xcf
\xd1
\xd3
\xd4
\xd6
\xd7
\xd9
\xdb
\xdc
\xde
\xe2
\xe3
\xe5
\xe6
\xe8
\xe9
\xeb
\xee
\xf1
\xf3
\xf5
\xf7
\xf9
\xfb
\xfc
\xfe

The text was updated successfully, but these errors were encountered:

The code that computes the byte classes had a bug that was only tripped when the maximum number of equivalence classes was necessary (256). This commit fixes that by rewriting the loop such that we never uselessly increment outside the bounds of `u8`. Fixes rust-lang#367

BurntSushi · 2017-05-20T14:07:28Z

Nice find. A "minimal" (or perhaps, more straight forward) reproduction is the following:

extern crate regex;

fn main() {
    let mut pat = "(?-u)".to_string();
    for i in 0..256 {
        let hexbyte = format!(r"\x{:02x}", i);
        pat.push_str(&hexbyte);
    }
    let _ = regex::bytes::Regex::new(&pat);
}

The underlying cause of this was the "byte class optimization." In particular, the optimization computes equivalence classes of bytes that are used to transition the DFA, since there is typically a much smaller number of equivalence classes than 256, which means the transition table in memory is much smaller. However, the part of the compiler that computed the equivalence classes tripped over a bug when the number of equivalence classes reached its maximum (256):

        let mut byte_classes = vec![0; 256];
        let mut class = 0u8;
        for i in 0..256 {
            byte_classes[i] = class as u8;
            if self.0[i] {
                class = class.checked_add(1).unwrap();
            }
        }
        byte_classes

In this case, on the last iteration of the loop, the class = class.checked_add(1).unwrap() would uselessly execute and overflow class (which is a u8), and is therefore what caused the panic.

I have a fix for this in #368. Thanks for the report!

compiler: fix a byte class bug The code that computes the byte classes had a bug that was only tripped when the maximum number of equivalence classes was necessary (256). This commit fixes that by rewriting the loop such that we never uselessly increment outside the bounds of `u8`. Fixes #367

BurntSushi mentioned this issue May 20, 2017

compiler: fix a byte class bug #368

Merged

BurntSushi added the bug label May 20, 2017

bors closed this as completed in #368 May 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pattern panics non-Unicode `bytes::Regex` #367

Pattern panics non-Unicode `bytes::Regex` #367

l4AgXc commented May 20, 2017

BurntSushi commented May 20, 2017

Pattern panics non-Unicode bytes::Regex #367

Pattern panics non-Unicode bytes::Regex #367

Comments

l4AgXc commented May 20, 2017

BurntSushi commented May 20, 2017

Pattern panics non-Unicode `bytes::Regex` #367

Pattern panics non-Unicode `bytes::Regex` #367