Skip to content

Latest commit

 

History

History
28 lines (18 loc) · 1.44 KB

algolia_search_party_crawler.md

File metadata and controls

28 lines (18 loc) · 1.44 KB

Source

The Crawling Search Party at Algolia

tl;dr

The challenges of crawling the web — and how to overcome them - Samuel Bodin, Algolia

  • Use an ablocker when rendering crawled site to save resource and time.
  • Be carefuk with query param, to avoir rendering multiple combination rewrite url to order them.
  • Put a hard limit on path concatenation to no blow up your crawler.
  • Trying to run your code in secure javascript engine will result in a huge perfomance trade off (x10).

Automatic extraction of structured data from the Web - Karl Leicht, Fabriks

  • Scrapy is a nice python scrapper but it's not good with broad crawls of many domains.
  • GoOse is a nice tool to extract main content from a html page.
  • Don't forget when you extract data to also save version of the extractor used in order to rescrape it if you found bug later on extractor. +Microdata is dead.

Writing a distributed crawler architecture - Nenad Tičarići, TNT Studio

  • Presentation of their open source crawler.

Afterword

So far crawling is still a difficult exercice to pull off when trying to make it work on multiple domains. Semantic web is still a far away dream :p .