The Crawling Search Party at Algolia
The challenges of crawling the web — and how to overcome them - Samuel Bodin, Algolia
- Use an ablocker when rendering crawled site to save resource and time.
- Be carefuk with query param, to avoir rendering multiple combination rewrite url to order them.
- Put a hard limit on path concatenation to no blow up your crawler.
- Trying to run your code in secure javascript engine will result in a huge perfomance trade off (x10).
Automatic extraction of structured data from the Web - Karl Leicht, Fabriks
- Scrapy is a nice python scrapper but it's not good with broad crawls of many domains.
- GoOse is a nice tool to extract main content from a html page.
- Don't forget when you extract data to also save version of the extractor used in order to rescrape it if you found bug later on extractor. +Microdata is dead.
Writing a distributed crawler architecture - Nenad Tičarići, TNT Studio
- Presentation of their open source crawler.
So far crawling is still a difficult exercice to pull off when trying to make it work on multiple domains. Semantic web is still a far away dream :p .