-
Notifications
You must be signed in to change notification settings - Fork 16
Home
Antriksh Saxena edited this page Sep 27, 2015
·
4 revisions
- Before building
nutchpy
from source, make sure you have the following setup:- Apache-maven (Installation instructions)
- py4j (Installation instructions)
- Get the source by cloning the repository using the following command.
git clone https://github.com/ContinuumIO/nutchpy.git
- Then run the following commands to run the
setup.py
script. (Make sure to have the super user permission while running the setup script)
cd nutchpy
sudo python setup.py install
The nutchpy
setup by default comes with 2 simple and easy to understand examples. It's basic usage is as follows:
import nutchpy
node_path = "<FULL-PATH-TO-CRAWLED-DATA>/data"
seq_reader = nutchpy.sequence_reader
print(seq_reader.head(n,node_path)) # Prints first n rows from the file
print(seq_reader.slice(start,stop,node_path)) # Prints lines between start and stop
data = seq_reader.read(node_path)
print(data) # Prints the whole file content
-
node_path
- It is generally the path to the crawled data file. Typically on a nutch default installation, it'd look something likenutch/runtime/local/crawl/crawldb/current/part-00000/data
To process the entire data and to run through the urls, read the content. The content is in the form of a list. The below sample runs through all the urls.
import nutchpy
path = 'path-to-nutch/nutch/runtime/local/crawl/crawldb/current/part-00000/data'
data = nutchpy.sequence_reader.read(path)
for list_item in data:
print(list_item[0]) # Prints the url
print(list_item[1]) # Prints details abt the url
A sample output of 1 row of the above code would be as follows
https://www.abc.com/xyz
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sat Sep 26 23:52:36 PDT 2015
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0
Signature: null
Metadata:
_repr_=https://www.abc.com/xyz
_pst_=moved(12), lastModified=0: https://www.cabelas.com/user/billing_address.jsp
Content-Type=text/html
_rs_=115
Using the above sample program, one can get all the details of the crawled database. We can get the status of the urls, whether it is fetched or not. We can also get the reason, as to why it failed and also mime-types of different fetched files.