-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Improving uri_regex #168
Comments
@guidovranken, thanks for reporting this. I think improving the regex is a good thing to do. This regex is from the URI specification [1] - Appendix B. While libtaxii uses the regex in a manner slightly different than the way the RFC uses it (libtaxii uses it for validation, the RFC uses it for capture groups / parsing), it was really used because it was a low-effort quick win. Thank you. |
Thanks. The perfect matching of a URI is more complex than it initially seems: see here for an attempt at constructing a sufficiently complete URI-matching regex. That particular regex might be usable for libtaxii although beware that it might be using a different syntax than Python's |
We actually use that Daring Fireball regex in the It seems to work fine for what we intended: just grabbing URLs out of email message bodies. |
Great, then this seems to be resolved. |
The current
uri_regex
as defined invalidation.py
is not sufficiently precise to differentiate between legitimate, specification-compliant URI's and strings that are impossible to be usable as URI's.Currently, the following regex is used:
Parts of this regex between brackets, such as
imply that every character may be matched except the colon, the slash, the question mark and the number sign (#). Consequently, invalid characters, including non-printable "garbage" characters, will be considered valid by the above regex:
(The first
re.match
demonstrates the (correct) validation ofhttp://taxii.mitre.org
, the secondre.match
demonstrates the incorrect validation of a string that contains newline characters).Given a sufficiently diverse amount of input data, I reckon that this might cause problems in the long run if strings that are not real URI's are parsed and "green-lighted" by libtaxii.
The text was updated successfully, but these errors were encountered: