Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Implement content sniffing for HTML parsing #808

Merged
merged 2 commits into from
Mar 27, 2024

Conversation

WGH-
Copy link
Collaborator

@WGH- WGH- commented Mar 25, 2024

Web pages can be served without Content-Type set, in which case browsers employ content sniffing. Do the same here, in Colly.

While we're at it, change the Content-Type check to something stricter than mere "html" substring match.

@WGH- WGH- force-pushed the content-sniffing branch from 69cc94a to 40d3e41 Compare March 25, 2024 21:30
@WGH-
Copy link
Collaborator Author

WGH- commented Mar 25, 2024

Welp, strings.Cut appeared only in Go 1.18. Instead of rewriting it the old way I decided to drop old Go versions (#810).

WGH- added 2 commits March 27, 2024 17:57
Instead of looking for "html" substring, actually parse the MIME type
string. Don't use mime.ParseMediaType though as it doesn't handle
invalid duplicate parameters (e.g. "text/html; charset=UTF-8; charset=utf-8")
that occur in the wild.
Web pages can be served without Content-Type set, in which case
browsers employ content sniffing. Do the same here, in Colly.
@WGH- WGH- force-pushed the content-sniffing branch from 40d3e41 to bad50ff Compare March 27, 2024 14:57
@WGH- WGH- marked this pull request as ready for review March 27, 2024 15:02
@WGH- WGH- requested a review from asciimoo March 27, 2024 15:07
@asciimoo asciimoo merged commit 5224b97 into gocolly:master Mar 27, 2024
9 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants