Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Improving uri_regex #168

Open
guidovranken opened this issue Feb 9, 2015 · 4 comments
Open

Improving uri_regex #168

guidovranken opened this issue Feb 9, 2015 · 4 comments

Comments

@guidovranken
Copy link
Contributor

The current uri_regex as defined in validation.py is not sufficiently precise to differentiate between legitimate, specification-compliant URI's and strings that are impossible to be usable as URI's.

Currently, the following regex is used:

uri_regex = RegexTuple("(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?", "URI Format")

Parts of this regex between brackets, such as

[^:/?#]

imply that every character may be matched except the colon, the slash, the question mark and the number sign (#). Consequently, invalid characters, including non-printable "garbage" characters, will be considered valid by the above regex:

>>> import re
>>> r = "(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
>>> re.match(r, "https://taxii.mitre.org").group(0)
'https://taxii.mitre.org'
>>> re.match(r, chr(0x0A) * 4 + "://taxii.mitre.org").group(0)
'\n\n\n\n://taxii.mitre.org'

(The first re.match demonstrates the (correct) validation of http://taxii.mitre.org, the second re.match demonstrates the incorrect validation of a string that contains newline characters).

Given a sufficiently diverse amount of input data, I reckon that this might cause problems in the long run if strings that are not real URI's are parsed and "green-lighted" by libtaxii.

@MarkDavidson
Copy link
Contributor

@guidovranken, thanks for reporting this. I think improving the regex is a good thing to do.

This regex is from the URI specification [1] - Appendix B. While libtaxii uses the regex in a manner slightly different than the way the RFC uses it (libtaxii uses it for validation, the RFC uses it for capture groups / parsing), it was really used because it was a low-effort quick win.

Thank you.
-Mark

[1] https://tools.ietf.org/html/rfc3986

@guidovranken
Copy link
Contributor Author

Thanks. The perfect matching of a URI is more complex than it initially seems: see here for an attempt at constructing a sufficiently complete URI-matching regex. That particular regex might be usable for libtaxii although beware that it might be using a different syntax than Python's re uses, I haven't really tried it yet..

@bworrell
Copy link
Contributor

@MarkDavidson, @guidovranken,

We actually use that Daring Fireball regex in the email-to-cybox tool here.

It seems to work fine for what we intended: just grabbing URLs out of email message bodies.

@guidovranken
Copy link
Contributor Author

Great, then this seems to be resolved.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants