Skip to content

tbpalsulich/nutch-auth-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nutch HTTP Client Authentication

This WIP Nutch deployment uses Nutch to automatically log into and crawl www.mrs.org.

Run build.sh to checkout the Nutch trunk, build it, and copy the necessary configuration files. Once finished, cd dist to use the newly configured Nutch distribution.

Please see conf/nutch-site.xml and conf/httpclient-auth.xml for the updated configuration files.

urls/seed.txt is used as a seed for crawling when you run the command bin/crawl urls/ CrawlData/ N, where N is the number of rounds of fetching.

You can try crawling a single page by running bin/nutch parsechecker http://mrs.org/home/.

After running a crawl or using parsechecker, the logs will be in logs/hadoop.log.

Make sure to update the credentials in dist/conf/httpclient-auth.xml!

About

Example of using Nutch to authenticate and crawl mrs.org

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages