Skip to content

Commit de03399

Browse files
committed
automata: fix incorrect use of Aho-Corasick's "standard" semantics
This fixes a bug in how prefilters were applied for multi-regexes compiled with "all" semantics. It turns out that this corresponds to the regex crate's RegexSet API, but only its `is_match` routine. See the comment on the regression test added in this PR for an explanation of what happened. Basically, it came down to incorrectly using Aho-Corasick's "standard" semantics, which doesn't necessarily report leftmost matches. Since the regex crate is really all about leftmost matching, this can lead to skipping over parts of the haystack and thus lead to missing matches. Fixes #1070
1 parent 7536e05 commit de03399

File tree

4 files changed

+44
-13
lines changed

4 files changed

+44
-13
lines changed

regex-automata/src/util/prefilter/aho_corasick.rs

+11-2
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,20 @@ impl AhoCorasick {
2222
}
2323
#[cfg(feature = "perf-literal-multisubstring")]
2424
{
25+
// We used to use `aho_corasick::MatchKind::Standard` here when
26+
// `kind` was `MatchKind::All`, but this is not correct. The
27+
// "standard" Aho-Corasick match semantics are to report a match
28+
// immediately as soon as it is seen, but `All` isn't like that.
29+
// In particular, with "standard" semantics, given the needles
30+
// "abc" and "b" and the haystack "abc," it would report a match
31+
// at offset 1 before a match at offset 0. This is never what we
32+
// want in the context of the regex engine, regardless of whether
33+
// we have leftmost-first or 'all' semantics. Namely, we always
34+
// want the leftmost match.
2535
let ac_match_kind = match kind {
26-
MatchKind::LeftmostFirst => {
36+
MatchKind::LeftmostFirst | MatchKind::All => {
2737
aho_corasick::MatchKind::LeftmostFirst
2838
}
29-
MatchKind::All => aho_corasick::MatchKind::Standard,
3039
};
3140
// This is kind of just an arbitrary number, but basically, if we
3241
// have a small enough set of literals, then we try to use the VERY

regex-automata/src/util/prefilter/mod.rs

-9
Original file line numberDiff line numberDiff line change
@@ -195,15 +195,6 @@ impl Prefilter {
195195
/// Some(Span::from(6..9)),
196196
/// pre.find(hay.as_bytes(), Span::from(0..hay.len())),
197197
/// );
198-
/// // Now we put 'samwise' back before 'sam', but change the match
199-
/// // semantics to 'All'. In this case, there is no preference
200-
/// // order semantics and the first match detected is returned.
201-
/// let pre = Prefilter::new(MatchKind::All, &["samwise", "sam"])
202-
/// .expect("a prefilter");
203-
/// assert_eq!(
204-
/// Some(Span::from(6..9)),
205-
/// pre.find(hay.as_bytes(), Span::from(0..hay.len())),
206-
/// );
207198
///
208199
/// # Ok::<(), Box<dyn std::error::Error>>(())
209200
/// ```

regex-automata/src/util/prefilter/teddy.rs

+7-2
Original file line numberDiff line numberDiff line change
@@ -50,12 +50,17 @@ impl Teddy {
5050
// theory we could at least support leftmost-longest, as the
5151
// aho-corasick crate does, but regex-automata doesn't know about
5252
// leftmost-longest currently.
53+
//
54+
// And like the aho-corasick prefilter, if we're using `All`
55+
// semantics, then we can still use leftmost semantics for a
56+
// prefilter. (This might be a suspicious choice for the literal
57+
// engine, which uses a prefilter as a regex engine directly, but
58+
// that only happens when using leftmost-first semantics.)
5359
let (packed_match_kind, ac_match_kind) = match kind {
54-
MatchKind::LeftmostFirst => (
60+
MatchKind::LeftmostFirst | MatchKind::All => (
5561
aho_corasick::packed::MatchKind::LeftmostFirst,
5662
aho_corasick::MatchKind::LeftmostFirst,
5763
),
58-
_ => return None,
5964
};
6065
let minimum_len =
6166
needles.iter().map(|n| n.as_ref().len()).min().unwrap_or(0);

testdata/regression.toml

+26
Original file line numberDiff line numberDiff line change
@@ -756,3 +756,29 @@ name = "reverse-inner-short"
756756
regex = '(?:([0-9][0-9][0-9]):)?([0-9][0-9]):([0-9][0-9])'
757757
haystack = '102:12:39'
758758
matches = [[[0, 9], [0, 3], [4, 6], [7, 9]]]
759+
760+
# This regression test was found via the RegexSet APIs. It triggered a
761+
# particular code path where a regex was compiled with 'All' match semantics
762+
# (to support overlapping search), but got funneled down into a standard
763+
# leftmost search when calling 'is_match'. This is fine on its own, but the
764+
# leftmost search will use a prefilter and that's where this went awry.
765+
#
766+
# Namely, since 'All' semantics were used, the aho-corasick prefilter was
767+
# incorrectly compiled with 'Standard' semantics. This was wrong because
768+
# 'Standard' immediately attempts to report a match at every position, even if
769+
# that would mean reporting a match past the leftmost match before reporting
770+
# the leftmost match. This breaks the prefilter contract of never having false
771+
# negatives and leads overall to the engine not finding a match.
772+
#
773+
# See: https://github.com/rust-lang/regex/issues/1070
774+
[[test]]
775+
name = "prefilter-with-aho-corasick-standard-semantics"
776+
regex = '(?m)^ *v [0-9]'
777+
haystack = 'v 0'
778+
matches = [
779+
{ id = 0, spans = [[0, 3]] },
780+
]
781+
match-kind = "all"
782+
search-kind = "overlapping"
783+
unicode = true
784+
utf8 = true

0 commit comments

Comments
 (0)