Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

warcio does not preserve HTTP header whitespace #129

Open
JustAnotherArchivist opened this issue May 27, 2021 · 3 comments
Open

warcio does not preserve HTTP header whitespace #129

JustAnotherArchivist opened this issue May 27, 2021 · 3 comments
Labels

Comments

@JustAnotherArchivist
Copy link
Contributor

JustAnotherArchivist commented May 27, 2021

import io
import warcio


output = io.BytesIO()
writer = warcio.warcwriter.WARCWriter(output, gzip = False)
payload = io.BytesIO()
payload.write(b'HTTP/1.1 200 OK\r\nDate: Thu, 27 May 2021 22:03:54 GMT\r\nContent-Length: 0\r\nX-custom:  header with two spaces before the value and a tab after\t\r\n\r\n')
payload.seek(0)
record = writer.create_warc_record('http://example.org/', 'response', payload = payload)
writer.write_record(record)
print(output.getvalue())

Expected output for the custom header (where \t is a literal tab):

X-custom:  header with two spaces before the value and a tab after\t

Actual output (only one space between the colon and the value, and the tab after the header is lost):

X-custom: header with two spaces before the value and a tab after
@ikreymer
Copy link
Member

This is sort of an edge case, and the whitespace was at one point used to indicate multi-line headers (which have now been deprecated, but warcio still supports). I'm not sure that the whitespace is significant anymore from a parsing perspective.
Similar to #128, perhaps there could be a 'raw' mode flag that preserves the whitespace here if desired for when capturing HTTP traffic.

@ikreymer
Copy link
Member

FWIW, I've never seen an HTTP server that returns a header like this, so (i hope) its not very common :)

@JustAnotherArchivist
Copy link
Contributor Author

The whitespace on the line with the field-name has never been significant semantically as far as I know. Neither the whitespace after the colon nor the one at the end of the line is part of the actual field value content. And even with continuation lines: the optional whitespace at the end of a line, CRLF, and leading space/tab on the continuation line are overall equivalent to a single space.
But yeah, same as #128, this is about correctly preserving the data sent by the server, not the semantic meaning. I've suggested a possible solution there because they are indeed very similar and have essentially the same root cause.

Yeah, it is fortunately not very common, but I have seen it before, sadly enough. There are a lot of weird HTTP servers out there that operate at the edges of or beyond the specifications...

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
Status: Triage
Development

No branches or pull requests

3 participants