Improving uri_regex #168

guidovranken · 2015-02-09T22:14:57Z

The current uri_regex as defined in validation.py is not sufficiently precise to differentiate between legitimate, specification-compliant URI's and strings that are impossible to be usable as URI's.

Currently, the following regex is used:

uri_regex = RegexTuple("(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?", "URI Format")

Parts of this regex between brackets, such as

[^:/?#]

imply that every character may be matched except the colon, the slash, the question mark and the number sign (#). Consequently, invalid characters, including non-printable "garbage" characters, will be considered valid by the above regex:

>>> import re
>>> r = "(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
>>> re.match(r, "https://taxii.mitre.org").group(0)
'https://taxii.mitre.org'
>>> re.match(r, chr(0x0A) * 4 + "://taxii.mitre.org").group(0)
'\n\n\n\n://taxii.mitre.org'

(The first re.match demonstrates the (correct) validation of http://taxii.mitre.org, the second re.match demonstrates the incorrect validation of a string that contains newline characters).

Given a sufficiently diverse amount of input data, I reckon that this might cause problems in the long run if strings that are not real URI's are parsed and "green-lighted" by libtaxii.

The text was updated successfully, but these errors were encountered:

MarkDavidson · 2015-02-10T13:02:15Z

@guidovranken, thanks for reporting this. I think improving the regex is a good thing to do.

This regex is from the URI specification [1] - Appendix B. While libtaxii uses the regex in a manner slightly different than the way the RFC uses it (libtaxii uses it for validation, the RFC uses it for capture groups / parsing), it was really used because it was a low-effort quick win.

Thank you.
-Mark

[1] https://tools.ietf.org/html/rfc3986

guidovranken · 2015-02-10T21:18:51Z

Thanks. The perfect matching of a URI is more complex than it initially seems: see here for an attempt at constructing a sufficiently complete URI-matching regex. That particular regex might be usable for libtaxii although beware that it might be using a different syntax than Python's re uses, I haven't really tried it yet..

bworrell · 2015-02-11T14:50:21Z

@MarkDavidson, @guidovranken,

We actually use that Daring Fireball regex in the email-to-cybox tool here.

It seems to work fine for what we intended: just grabbing URLs out of email message bodies.

guidovranken · 2015-02-11T21:08:35Z

Great, then this seems to be resolved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving uri_regex #168

Improving uri_regex #168

guidovranken commented Feb 9, 2015

MarkDavidson commented Feb 10, 2015

guidovranken commented Feb 10, 2015

bworrell commented Feb 11, 2015

guidovranken commented Feb 11, 2015

Improving uri_regex #168

Improving uri_regex #168

Comments

guidovranken commented Feb 9, 2015

MarkDavidson commented Feb 10, 2015

guidovranken commented Feb 10, 2015

bworrell commented Feb 11, 2015

guidovranken commented Feb 11, 2015