Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add HTTP protocol version to HTTP request message #34

Open
sebastian-nagel opened this issue Dec 12, 2019 · 1 comment
Open

Add HTTP protocol version to HTTP request message #34

sebastian-nagel opened this issue Dec 12, 2019 · 1 comment

Comments

@sebastian-nagel
Copy link
Collaborator

sebastian-nagel commented Dec 12, 2019

The request records in the CC-NEWS WARC files lack the HTTP protocol version:

GET /path 

instead of

GET /path HTTP/1.1

This makes some WARC parsers fail to process the WARC files, see https://groups.google.com/d/msg/common-crawl/hsb90GHq6to/Lv-9-nHAAQAJ.

@sebastian-nagel
Copy link
Collaborator Author

Fix in Stormcrawler (apache/incubator-stormcrawler#775) deployed to production, WARC files now contain the HTTP version in the request message.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant