Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

StartTag: invalid element name #170

Closed
c-x opened this issue Feb 10, 2015 · 13 comments
Closed

StartTag: invalid element name #170

c-x opened this issue Feb 10, 2015 · 13 comments

Comments

@c-x
Copy link

c-x commented Feb 10, 2015

Hello,

I have the following error with libtaxii-1.1.104 (when receiving content from Soltra).

StartTag: invalid element name, line 555, column 2

This error is caught when libtaxii process a addinfourl like the following.

<addinfourl at 140659822923856 whose fp = <socket._fileobject object at 0x7fedead42ed0>>

My piece of code that fails is the following:

try:
        # the connection to the TAXII server works correctly
        taxii_client = get_taxii_connection(taxii_login, taxii_password, taxii_cert_pem, taxii_cert_key, taxii_use_https)

        poll_req = tm.PollRequest(
                message_id = tm.generate_message_id(), 
                collection_name = taxii_feed_id,
                exclusive_begin_timestamp_label = timestamp_label,
                poll_parameters = tm.PollRequest.PollParameters()
        )

        resp = taxii_client.callTaxiiService2(taxii_host, taxii_path, t.VID_TAXII_XML_11, poll_req.to_xml(pretty_print=True), taxii_port)

        # When entering this method, the above error about StartTag is thrown.
        taxii_message = t.get_message_from_http_response(resp, poll_req.message_id)

        # so this never happens
        print "Bonjour :)"
except Exception, e:
        raise e

Have you seen this error before ? Is this a known issue ? Have you any idea on what's going on ?

Thanks.

@gtback
Copy link
Contributor

gtback commented Feb 10, 2015

@c-x, do you have a Python traceback showing where in callTaxiiService2() the error is coming from? I didn't notice anything obvious from a quick skim through the code, so I can't tell if it's an issue with the Soltra content or with processing the network traffic.

Also, can you confirm that the error still occurs in 1.1.105, and/or on the master branch?

@c-x
Copy link
Author

c-x commented Feb 10, 2015

@gtback I don't have myself access to Soltra so it's really hard to isolate the bug (and it take a lot of time because I must send patched code to customers and wait for their feedback).

Do I have a traceback: no.
I'll ask people to try with 1.1.105 but according to the release notes and the commits on github, I don't think this will make any difference.

@gtback
Copy link
Contributor

gtback commented Feb 10, 2015

It would be easiest to figure this out if we had a traceback (so we know what line is actually generating the error), and ideally the XML response that is generating the error. It looks a lot like an error that would come from malformed XML, especially if it's occuring in callTaxiiService2. The actual TAXII message isn't parsed until get_message_from_http_response.

@gtback
Copy link
Contributor

gtback commented Feb 10, 2015

If callTaxiiService2 succeeds, try running print resp.read() directly after that to get the response content.

@c-x
Copy link
Author

c-x commented Feb 10, 2015

I also agree with you on the probable malformed XML content but I have no evidence so far. I was hoping you had an access to Soltra to reproduce the problem.
FYI, I was unable to reproduce the problem with hailataxii feeds.

@MarkDavidson
Copy link
Contributor

@c-x,

Looking at what you've described so far, I'm having a bit of trouble figuring out how callTaxiiService2 would be where the code is failing.

First, you mention StartTag: invalid element name, line 555, column 2, which has a few hints:

  • The invalid element name is an XML parse error, and the code you show does not attempt to parse XML until the line that reads taxii_message = t.get_message_from_http_response(resp, poll_req.message_id)
  • The error location is line 555, column 2, which seems to indicate that the error is in a Poll Response. TAXII Poll Requests like the one the above code creates are only a few lines (maybe 10-15 if you pretty print it). If the error is in fact in a Poll Response, that means you are getting a response from the server.

Second, you mention that the error occurs when libtaxii is processing an addinfourl object. addinfourl objects are returned by callTaxiiService2 (specifically libtaxii's underlying use of urllib2.urlopen), which leads me to think that the callTaxiiService2 returned successfully. If that function failed, I do not have an explanation as to how you'd have a handle on the return object.

The t.get_message_from_http_response(resp, poll_req.message_id) line parses XML from the HTTP Response (encapsulated in an addinfourl object) [1] [2].

I can replicate the error you are getting with the following code (note the extraneous begin bracket <):

from lxml import etree
etree.XML('<hello><</hello>')
# Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "lxml.etree.pyx", line 3072, in lxml.etree.XML (src\lxml\lxml.etree.c:70460)
#  File "parser.pxi", line 1828, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:106689)
#  File "parser.pxi", line 1716, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:105478)
#  File "parser.pxi", line 1086, in lxml.etree._BaseParser._parseDoc (src\lxml\lxml.etree.c:100105)
#  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94543)
#  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:96003)
#  File "parser.pxi", line 620, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:95050)
# lxml.etree.XMLSyntaxError: StartTag: invalid element name, line 1, column 9

Based on that, my bet is that the STIX content in the Poll Response has an extra begin bracket (therefore being invalid XML) and that's what's blowing up.

As @gtback mentioned to me offline, libtaxii can do a much better job of handling processing errors. So regardless of this issue's resolution, I'll open an issue for that.

@c-x - I don't mean to disagree too much, but I'm having trouble arriving at your conclusion with the evidence you've provided. I've attempted to explain my thought process so that we can reach a conclusion about what's happening and the best way to fix it. I realize I'm not coming from the debug logs (like you are), but rather an understanding of the code, so I may be missing a key piece of information.

Thank you.
-Mark

[1] https://github.com/TAXIIProject/libtaxii/blob/master/libtaxii/__init__.py#L96
[2] https://github.com/TAXIIProject/libtaxii/blob/master/libtaxii/messages_11.py#L87

@c-x
Copy link
Author

c-x commented Feb 10, 2015

@MarkDavidson You are absolutely right on where the error is. My bad, my caffeine level was too low when I opened the ticket.

What you don't have in the code I posted, is some debug print. So, in the debug, and to confirm your analysis, the last print displayed is before the call to taxii_message = t.get_message_from_http_response(resp, poll_req.message_id) .

I'll edit the first post.

@MarkDavidson
Copy link
Contributor

@c-x,

No worries at all! I know how tough it can be to try and debug a system you can't actually interact with (not enough coffee in the world for that problem).

My current theory is bad Content in the Poll Response. There are two ways you could try and capture the XML that's being received: Modify your code or modify libtaxii's code.

If you wanted to modify your code, I'd say put something like print resp.read() before get_message_from_http_response. That call will cause the entire thing to always fail (because resp.read() is not idempotent), but you'll at least be able to trap the output.

If you wanted to modify libtaxii's code, open messages_11.py for editing (%python_home%\Lib\site-packages\<libtaxii_directory>\messages_11.py) and right after the lines that read

    if isinstance(xml_string, basestring):
        f = StringIO.StringIO(xml_string)
    else:
        f = xml_string

Add a line that reads print f or raise Exception(f) (or something like that).

As a side note, I recall that somebody was asking for logging in libtaxii, and I wasn't 100% clear on the use case for it. My sense is that if libtaxii had a configurable debug log this would have been a much easier fix.

Thank you.
-Mark

PS - I opened this related issue: #171

EDIT: I had incorrectly written __init__.py instead of messages_11.py. Thanks @gtback for noticing.

@c-x
Copy link
Author

c-x commented Feb 11, 2015

OK Thanks. I'll update this post as soon as I have a real world sample to share.

@c-x
Copy link
Author

c-x commented Feb 12, 2015

So, I finally got an IOC that cause libtaxii/lxml to throw the error. As suspected, it's due to an improper IOC like the following. As you can see, the brackets are url/html-encoded, which is a weird behavior of the TAXII Server. In short, this ticket can be closed as it's not a python-libtaxii issue :)

<taxii_11:Content_Block xmlns:taxii="http://taxii.mitre.org/messages/taxii_xml_binding-1" xmlns:taxii_11="http://taxii.mitre.org/messages/taxii_xml_binding-1.1" xmlns:tdq="http://taxii.mitre.org/query/taxii_default_query-1"><taxii_11:Content_Binding binding_id="urn:stix.mitre.org:xml:1.1.1"/>
<taxii_11:Content>&lt;stix:STIX_Package 
    xmlns:cyboxCommon="http://cybox.mitre.org/common-2"
    xmlns:cybox="http://cybox.mitre.org/cybox-2"
    xmlns:cyboxVocabs="http://cybox.mitre.org/default_vocabularies-2"
    xmlns:marking="http://data-marking.mitre.org/Marking-1"
    xmlns:tlpMarking="http://data-marking.mitre.org/extensions/MarkingStructure#TLP-1"
    xmlns:edge="http://soltra.com/"
    xmlns:indicator="http://stix.mitre.org/Indicator-2"
    xmlns:stixCommon="http://stix.mitre.org/common-1"
    xmlns:stixVocabs="http://stix.mitre.org/default_vocabularies-1"
    xmlns:stix="http://stix.mitre.org/stix-1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="edge:Package-12345678-1234-1234-1234-123456789abc" version="1.1.1" timestamp="2015-02-11T21:16:31.157529+00:00"&gt;
    &lt;stix:STIX_Header&gt;
        &lt;stix:Handling&gt;
            &lt;marking:Marking&gt;
                &lt;marking:Controlled_Structure&gt;../../../../descendant-or-self::node()&lt;/marking:Controlled_Structure&gt;
                &lt;marking:Marking_Structure xsi:type='tlpMarking:TLPMarkingStructureType' color="GREEN"/&gt;
            &lt;/marking:Marking&gt;
        &lt;/stix:Handling&gt;
    &lt;/stix:STIX_Header&gt;
    &lt;stix:Indicators&gt;
        &lt;stix:Indicator id="POUET:indicator-12345678-1234-1234-1234-123456789abc" xsi:type='indicator:IndicatorType' version="2.0"&gt;
            &lt;indicator:Type&gt;Malicious E-mail&lt;/indicator:Type&gt;
            &lt;indicator:Description&gt;.....description goes here.......&lt;/indicator:Description&gt;
            &lt;indicator:Observable idref="POUET:Observable-12345678-1234-1234-1234-123456789abc"&gt;
            .... other stuff goes here....
            &lt;/indicator:Observable&gt;
        &lt;/stix:Indicator&gt;
    &lt;/stix:Indicators&gt;
&lt;/stix:STIX_Package&gt;
</taxii_11:Content>
<taxii_11:Timestamp_Label>2015-02-11T21:16:31.158100+00:00</taxii_11:Timestamp_Label>
</taxii_11:Content_Block>

@gtback
Copy link
Contributor

gtback commented Feb 12, 2015

Yes, it looks like the server is incorrectly escaping the STIX content in the Content_Block. I don't know which web framework Soltra uses, but it's pretty common in a lot of web frameworks for this to be the default (as a defense across XSS and other content security issues).

I'm going to go ahead and close this. It's possible that we could do something in libtaxii to try to detect and correct these types of issues, but I think it would be a decent amount of effort, and the result likely pretty brittle, just to accept content that doesn't actually conform to the TAXII specs.

Thanks for tracking this down, @c-x !

@gtback gtback closed this as completed Feb 12, 2015
@MarkDavidson
Copy link
Contributor

As an FYI, the most common cause I've seen for this is attempting to assign the Content's text value to some XML string instead of appending an XML tree. It's kind of an easy thing to get tripped up on, especially when reading XML out of a database.

For instance, using python and lxml:

The "wrong" way:

from lxml import etree
elt = etree.Element('name')
elt.text = '<xml/>'
etree.tostring(elt)
# '<name>&lt;xml/&gt;</name>'

The "right" way:

from lxml import etree
elt = etree.Element('name')
elt.append(etree.XML('<xml/>'))
etree.tostring(elt)
# '<name><xml/></name>'

What's interesting though, at least for libtaxii, is that ContentBlock.from_xml() doesn't fail on the provided input. Perhaps there was an extra bracket ('<' or '>') in the description or observable fields that you redacted? If there is, that's a likely culprit.

-Mark

@jasenj1
Copy link
Contributor

jasenj1 commented Feb 12, 2015

Yep. The Java-TAXII library has to do this XML string to XML tree dance to properly embed STIX. Initially I was trying to embed the XML string and it got escaped just as shown.

  • Jasen.

From: MarkDavidson <notifications@github.heygears.commailto:notifications@github.com>
Reply-To: TAXIIProject/libtaxii <reply@reply.github.heygears.commailto:reply@reply.github.com>
Date: Thursday, February 12, 2015 at 10:10 AM
To: TAXIIProject/libtaxii <libtaxii@noreply.github.heygears.commailto:libtaxii@noreply.github.com>
Cc: Jasen Jacobsen <jasenj1@mitre.orgmailto:jasenj1@mitre.org>
Subject: Re: [libtaxii] StartTag: invalid element name (#170)

As an FYI, the most common cause I've seen for this is attempting to assign the Content's text value to some XML string instead of appending an XML tree. It's kind of an easy thing to get tripped up on, especially when reading XML out of a database.

For instance, using python and lxml:

The "wrong" way:

from lxml import etree
elt = etree.Element('name')
elt.text = ''
etree.tostring(elt)

'<xml/>'

The "right" way:

from lxml import etree
elt = etree.Element('name')
elt.append(etree.XML(''))
etree.tostring(elt)

''

What's interesting though, at least for libtaxii, is that ContentBlock.from_xml() doesn't fail on the provided input. Perhaps there was an extra bracket ('<' or '>') in the description or observable fields that you redacted? If there is, that's a likely culprit.

-Mark


Reply to this email directly or view it on GitHubhttps://github.com//issues/170#issuecomment-74086029.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants