Skip to content

karanjeets/PCF-Nutch-on-Wrangler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PCF - Nutch on Wrangler

A Portable Crawling Framework (PCF) for Apache Nutch 1.x to run on TACC Wrangler - a supercomputer funded by NSF.

This was started as a part of another project - "Crawl Evaluation" where we evaluated Apache Nutch v1.12 on Wrangler in both Hadoop and Local mode thereby pushing the crawler to its limits for a best throughput. It also includes some of the challenging stuff - Broad crawling, Focused crawling, Intelligent Crawling, Domain Discovery and many more...

PCF provides a crawling workspace for Wrangler which is both automated and portable. It is now integrated with Apache Kafka as well. More details can be found from the respective README files.

Quick Links

About

A repository for Nutch crawl evaluation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published