Skip to content
Antriksh Saxena edited this page Sep 27, 2015 · 4 revisions

Installation

  1. Before building nutchpy from source, make sure you have the following setup:
  2. Get the source by cloning the repository using the following command.
    git clone https://github.com/ContinuumIO/nutchpy.git
  3. Then run the following commands to run the setup.py script. (Make sure to have the super user permission while running the setup script)
     cd nutchpy
     sudo python setup.py install

Usage

The nutchpy setup by default comes with 2 simple and easy to understand examples. It's basic usage is as follows:

 import nutchpy

 node_path = "<FULL-PATH-TO-CRAWLED-DATA>/data"
 seq_reader = nutchpy.sequence_reader
 print(seq_reader.head(n,node_path)) # Prints first n rows from the file
 print(seq_reader.slice(start,stop,node_path)) # Prints lines between start and stop
 data = seq_reader.read(node_path)
 print(data) # Prints the whole file content
  • node_path - It is generally the path to the crawled data file. Typically on a nutch default installation, it'd look something like nutch/runtime/local/crawl/crawldb/current/part-00000/data To process the entire data and to run through the urls, read the content. The content is in the form of a list. The below sample runs through all the urls.
     import nutchpy

     path = 'path-to-nutch/nutch/runtime/local/crawl/crawldb/current/part-00000/data'

     data = nutchpy.sequence_reader.read(path)
     for list_item in data:
          print(list_item[0]) # Prints the url
          print(list_item[1]) # Prints details abt the url

A sample output of 1 row of the above code would be as follows

https://www.abc.com/xyz
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sat Sep 26 23:52:36 PDT 2015
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0
Signature: null
Metadata: 
 	_repr_=https://www.abc.com/xyz
	_pst_=moved(12), lastModified=0: https://www.cabelas.com/user/billing_address.jsp
	Content-Type=text/html
	_rs_=115

Using the above sample program, one can get all the details of the crawled database. We can get the status of the urls, whether it is fetched or not. We can also get the reason, as to why it failed and also mime-types of different fetched files.

Clone this wiki locally