Resolved ReDoS vulnerability in Corpus Reader #2816
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello!
Pull request overview
Background
Wikipedia page for ReDoS.
The ReDoS
The regular expression vulnerable to a ReDoS is compiled here:
nltk/nltk/corpus/reader/comparative_sents.py
Line 48 in 23f4b1c
And only used once, right here:
nltk/nltk/corpus/reader/comparative_sents.py
Line 259 in 23f4b1c
Regex breakdown
In full:
It consists of 4 segments, described here:
Match the character
(
exactly.(?!.*\()
Negative lookahead. Makes sure that
.*\(
can not be matched. In short, this means that the remainder of the match (after the previous segment) can not contain another(
.(.*)
Match the character
)
exactly, and ensure that this is the end of the line.What does this regex try to do?
It tries to find information within the brackets in each line in e.g.
i.e. extract
more
,nicer
,more
,better
,cheaper
. This is what segments 1, 3 and 4 do. Segment 2 ensures that the starting(
is the most right-most(
in the entire input. This way, in e.g.Only
more
is extracted.The ReDoS comes from the combination of the two
.*
's, and likely results from naive backtracking in Python's regular expression engine.The fix
The new regex looks like so:
It's fairly similar to the previous regex. Segments 1 and 4 are reused, while segment 2 is removed, and segment 3 is modified.
Segment 3 is now:
This regular expression will now, rather than match anything, match anything except
(
. Because the$
at the end anchors us at the end of the line, this is still guaranteed to only get the last case of e.g.... (more)
.I've quickly tested this new regex with all lines in the Comparative Sentences Dataset, and it finds just as many matches as the old regex. I believe it's identical in output, and only differs in performance.
Tests
In order to help convince that this has actually resolved the issue, I've created a doctest in corpus.doctest. It will:
KEYWORD.findall(payload)
9 times with a malicious payload of 4000 characters. Then, take the mean execution time between these 9 calls. This is dubbed the short mean.KEYWORD.findall(payload)
9 times with a malicious payload of 40000 (!) characters. Then, take the mean execution time between these 9 calls. This is dubbed the long mean.To play it safe, we ensure that the long mean is at most 30 times as big. A value of 30 seems to work fine.
When the ReDoS was still intact, the long mean would be rougly 80 times as big, which definitely indicates that the regex is not linear in execution time.
Thank you for reporting this vulnerability through our team email.