Nutch Tika Solr

This project contains course work of 'Information Retrieval and web Search Engines' (CSCI 572) course of University of Southern California. The main theme of this project is building inverted index using Apache Lucene/Solr. The data is crawled from web using Apache Nutch and it is read from segments using Apache Hadoop-HDFS API. Additional enrichment to documents is made by parsing documents with Apache Tika.

NOTE : Visit Step By Step Guide for homework evaluation for knowing how to make use of this code.

Requirements

JDK 1.8 (we faced issues with Open JDK 8, so please use Sun JDK 1.8)
Newer version of Maven (used 3.3)
Internet connection to download maven dependencies

Additional Setup

During the course of this project, we enhanced Apache Tika by adding a NamedEntityParser and supplied an implementation of Named Entity Recogniser based on StanfordCoreNLP. Edit: Its now part of Tika 1.12, pull the latest version of tika (>= 1.12)

Build Tika CoreNlp addon NLP
- git clone git@github.com:thammegowda/tika-ner-corenlp.git
- mvn install
Lucene Geo Gazetteer The FileDump Poster requires https://github.com/chrismattmann/lucene-geo-gazetteer.
- git clone git@github.com:chrismattmann/lucene-geo-gazetteer.git
- Follow the instructions in the README and launch the geo gazetteer server on default port

How to build.

After completing the Additional Setup process, the build is as simple as running the following command: mvn clean package

This command should produce a jar at target/nutch-tika-solr*.jar. Use this to run in the next step

How to run

Run java -jar target/nutch-tika-solr-1.0-SNAPSHOT.jar command. It will show sub commands

This project offers sub commands.

Usage : Main <CMD>
The following command(CMD)s are available
     index :  Index nutch segments to solr
phase2parse :  Pharses the text content for NER and updates index
     graph :  Builds a graph of documents, and writes the edges set to file 
  pagerank :  Computes page rank for nodes in graph
  postdump :  Parse the file dump  and post 'em to solr 
  outlinks :  Parse the nutch segments for outlinks and posts 'em to solr
lastmodified :  Retrieves last-modified HTTP header value from nutch segments and updates it to Solr
updaterank :  Updates Page rank
parsecheck :  Checks the parser output (input a config and a file )

postdump command

This command parses file dump and posts the parsed content to solr.

Usage :

  java -jar target/nutch-tika-solr-1.0-SNAPSHOT.jar postdump
Option "-solr" is required
 -batch N   : Batch size for buffering solr postings (default: 500)
 -in FILE   : Path to Files that are to be parsed and indexed
 -list FILE : Path Containing List of files to be processed
 -solr URL  : Solr URL
 -threads N : Number of Threads (default: 5)
 -timeout N : task timeout. The parser should finish within this time millis
              (default: 15000)

index Command

This command loads nutch segment content to solr, parses metadata using tika.

Usage :

java -jar target/nutch-tika-solr-1.0-SNAPSHOT.jar index
 -batch (--batch-size) N  : Number of documents to buffer and post to solr
                            (default: 1000)
 -segs (--seg-paths) FILE : Path to a text file containing segment paths. One
                            path per line
 -url (--solr-url) URL    : Solr url

Example :

java -jar target/nutch-tika-solr-1.0-SNAPSHOT.jar index -batch 300 \
  -segs data/paths-all.txt \
  -url http://localhost:8983/solr

graph Command

This command creates graph of documents in solr index. The edges will be written to text file on disk

Usage :

  java -jar target/nutch-tika-solr-1.0-SNAPSHOT.jar graph
   -edge [locations | persons | dates |   : Edge type. This should be a field in
    organizations]                           solr docs.
   -out FILE                              : Output File for writing the edges of
                                             graph.
   -solr URL                              : Solr URL to query docs

Example :

  java -jar target/nutch-tika-solr-1.0-SNAPSHOT.jar graph \
    -solr http://localhost:8983/solr/  \
    -out locations.txt -edge locations

pagerank Command

This command takes graph configuration in the form of edges, computes pagerank and outputs the ranks to a file

Usage :

  java -jar target/nutch-tika-solr-1.0-SNAPSHOT.jar pagerank
   -edges FILE : Path to file having graph's edges
    -n N        : Number of Iteration
    -out FILE   : Path to Output file for storing page ranks
    -t N        : Number of Threads (default: 2)

Example :

  java -jar target/nutch-tika-solr-1.0-SNAPSHOT.jar pagerank \
    -edges locations.txt -n 5 -out pr-loc.txt

phase2parse command

This is a sub command for running phase 2 parser. In this phase docs from a solr core (indexed from 'nutch index' command) are imported, Named entity parser is run to extract names of people, locations, organizations, and also dates, weapon names, weapon types. The result is updated to another solr core.

Usage :

$ java -jar target/nutch-tika-solr-1.0-SNAPSHOT.jar phase2parse \
 -batch (--batch-size) N : Number of documents to buffer and post to solr
                           (default: 1000)
 -dest (--dest-solr) URL : Destination Solr url
 -q (--query) VAL        : Import Query (default: *:*)
 -src (--src-solr) URL   : Source Solr url
 -start (--start) N      : Import start (default: 0)

Example :

  java -jar target/nutch-tika-solr-1.0-SNAPSHOT.jar phase2parse \
    -src http://localhost:8983/solr/weapons1 \
    -dest http://localhost:8983/solr/weapons3 \
    -batch 100 -q '*:*' -start 0

updaterank command This command takes pageranks file from the output of 'pageranks' command and posts it to solr.

Usage :

$ java -jar target/nutch-tika-solr-1.0-SNAPSHOT.jar updateranks
  -batch (--batch-size) N    : Batch or buffer size (default: 1000)
  -field (--rank-field) VAL  : Solr schema field for storing the page rank
  -ranks (--ranks-file) FILE : File containing Page ranks. Each line should have
                               'URL	(double)SCORE'
  -solr (--solr-url) URL     : Solr server URL

Example :

$ java -jar target/nutch-tika-solr-1.0-SNAPSHOT.jar updaterank \
  -field location_pr -ranks pageranks-locations.txt \
  -solr http://localhost:8983/solr/collection2

Developers / Team

Thamme Gowda N.
Rakshith
Rahul
Nii Mante

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
conf		conf
data		data
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml
step-by-step.txt		step-by-step.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nutch Tika Solr

Requirements

Additional Setup

How to build.

How to run

Developers / Team

About

Releases

Packages

Languages

USCDataScience/parser-indexer

Folders and files

Latest commit

History

Repository files navigation

Nutch Tika Solr

Requirements

Additional Setup

How to build.

How to run

Developers / Team

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages