The repository for the InfoChain Project
There are two data collections that I used:
- news crawled before the DRC project using Yahoo! News API.
- try to crawl 100 news articles per hour
- redundant articles (with same url as one of the previously stored articles) are ignored.
- news crawled during the DRC project from Google news.
- try to crawl 10 pages of results per day, and the results are limited to the day.
- no url based redundancy detection is used
-
Pre DRC: can be found in data/pre_drc_data/path.json
It includes the paths for news data and the queries (disasters). -
DRC: can be found in data/drc_data/path.json
It includes the paths for news data, tweets, and the queries (disasters).
The code resides in Infolab at:/infolab/node4/lukuang/code/InfoChain
In order to run the code, you need the have the utility modules I wrote before in your python module search path. One way to do it is to download the module and include the path of the module to the PYTHONPATH environment variable:
$ export PYTHONPATH=/path-contains-module:$PYTHONPATH
The software version requirements are:
- python: 2.7
- Java: Java 8 (for Stanford NER)
- Indri: 5.6
There are two modules for processing the two data collections:
- data/pre_drc_data
It contains the module for processing the data related to the Pre DRC crawl. There are 4 files:
- __init__.py:
- news_data.py: containing classes for handling news articles in Pre DRC data
- query.py: processing the queries in the query file of the Pre DRC data
- path.json: json format file containing the paths for Pre DRC data
- data/drc_data
It contains the module for processing the data related to the DRC crawl. There are 4 files:
- __init__.py:
- news_data.py: containing classes for handling news articles in DRC data set
- twitter_data.py: intended for processing DRC tweet data, not implemented yet
- query.py: processing the queries in the query file of the DRC data
- path.json: json format file containing the paths for DRC data
And two python scripts easy_baseline/get_pre_drc_news_text.py and easy_baseline/get_drc_news_text.py for generating trec format text files for these two data sets in preparation for building an Indri index from them. For these two scripts, two parameters can be taken:
- dest_dr(required) : the destination directory for the text
- -dl (optional): the number of days from the beginning of a disaster that you want the news article to be included in the index. If no limit is wanted, do not use this parameter.
For generating an Indri Index from the text files, please use the script easy_baseline/generate_index_para.py, which takes three required parameters:
- index_path: the destination of the index that you want to generate
- data_path: the data directory path of the text files (referring to the dest_dr mentioned above).
- para_file_path: the path of the Indri parameter file as the output of this script
Once the parameter file is generated, use the following command to generate the index
$ IndriBuildIndex para_file_path
I implement a very simple implementation (easy_baseline/easy_baseline.py) to find common entities of two disaster instances that relies only on the occurrence counts of the entities.
The workflow is as follow:
- A User will input two queries, and possible other optional parameters, to the script.
- The script will search the two queries against the Indri index generated above.
- If there are common documents in the top 10 documents, return the common documents' ids.
- If there are no common documents, first generate the vector representations of each of the top 100 documents. The vectors are the occurrence counts of words or entities in the document. There are three ways to generate the vector:
- use individual words.
- use the name entities identified by Stanford NER.
- use the Wikipedia entities found by the Dbpedia Spotlight's web api in the document.
- Merge the document vectors for each query. Compare the total vectors of the two queries to find common entities that are within the 100 most frequent entities of each query. Record the entities and their smaller occurrence counts for one of the queries.
- The common entities will be printed to the standard output as (entity, smaller count) tuples.
- The entities as well as the ids of the documents mentioning the entities will be output as json format in the file
/infolab/node4/lukuang/code/InfoChain/easy_baseline/data/results/easy_basline/METHOD_NAME/QUERY_STRING
where theMETHOD_NAME
is the method used for generating vectors (words, Stanford NER, etc.), and theQUERY_STRING
is the concatenation of the query strings in which the queries are ordered alphabetically in order to be combined.
Two required parameters:
- query_1
- query_2
Optional parameters:
-
-td: top document count, the number of top documents in which the script tries to find common documents. The default is 10.
-
-cd: common info doc count, the number of documents in which the script tries to find common entities. The default is 100.
-
-nt: number of top common info: the number of most frequent entities to be considered when finding common entities. For instance, if it is specified as 100, the most frequent 100 entities of each query will be compared and the common entities will be used as output. The default is 100.
-
-vm: vector method, what kind of vectors to be generated. The input options are:
- 0: words
- 1: Stanford NER
- 2: Dbpedia Spotlight
The default value is 0 (words).
-
-rd: result dir, the result directory that the entities as well as the documents containing them should be stored to
-
-vd: vector dir, the vector directory that the intermediate document vector results will be stored to.
-
-ind: index dir, the index directory that the queries will be searched against.
-
-tqf: temporal query file, the temporal Indri query parameter file
Feel free to execute multiple instances with different vector methods. However, executing multiple instances with same vector method might be a bad idea since they might try to read/write the document vector information at the same time.
There is a script easy_baseline/view_entity.py through which you can show the surrounding text of a common entity for both queries to get an idea about how the entity is used in the two disasters. There are three parameters of it:
- disaster_vector_result_file (required): the result file generated by easy_baseline.py
- entity (required): the name of the entity that you want to show
- -ind (optional): the Indri index path