A Portable Crawling Framework (PCF) for Apache Nutch 1.x to run on TACC Wrangler - a supercomputer funded by NSF.
This was started as a part of another project - "Crawl Evaluation" where we evaluated Apache Nutch v1.12 on Wrangler in both Hadoop and Local mode thereby pushing the crawler to its limits for a best throughput. It also includes some of the challenging stuff - Broad crawling, Focused crawling, Intelligent Crawling, Domain Discovery and many more...
PCF provides a crawling workspace for Wrangler which is both automated and portable. It is now integrated with Apache Kafka as well. More details can be found from the respective README files.