Source

The Crawling Search Party at Algolia

tl;dr

The challenges of crawling the web — and how to overcome them - Samuel Bodin, Algolia

Use an ablocker when rendering crawled site to save resource and time.
Be carefuk with query param, to avoir rendering multiple combination rewrite url to order them.
Put a hard limit on path concatenation to no blow up your crawler.
Trying to run your code in secure javascript engine will result in a huge perfomance trade off (x10).

Automatic extraction of structured data from the Web - Karl Leicht, Fabriks

Scrapy is a nice python scrapper but it's not good with broad crawls of many domains.
GoOse is a nice tool to extract main content from a html page.
Don't forget when you extract data to also save version of the extractor used in order to rescrape it if you found bug later on extractor. +Microdata is dead.

Writing a distributed crawler architecture - Nenad Tičarići, TNT Studio

Presentation of their open source crawler.

Afterword

So far crawling is still a difficult exercice to pull off when trying to make it work on multiple domains. Semantic web is still a far away dream :p .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

algolia_search_party_crawler.md

algolia_search_party_crawler.md

Source

tl;dr

The challenges of crawling the web — and how to overcome them - Samuel Bodin, Algolia

Automatic extraction of structured data from the Web - Karl Leicht, Fabriks

Writing a distributed crawler architecture - Nenad Tičarići, TNT Studio

Afterword

Files

algolia_search_party_crawler.md

Latest commit

History

algolia_search_party_crawler.md

File metadata and controls

Source

tl;dr

The challenges of crawling the web — and how to overcome them - Samuel Bodin, Algolia

Automatic extraction of structured data from the Web - Karl Leicht, Fabriks

Writing a distributed crawler architecture - Nenad Tičarići, TNT Studio

Afterword