Probabilistic Ranking

Done as part of CS572 Information Retrieval instructed by Dr. Eugene Agichtein.

Developed and tested on Ubuntu 22.04.2 LTS.

To implement probabilistic ranking with basic smoothing, and with blind relevance feedback. A standard benchmark corpus is provided for ranking evaluation (subset of TREC ad-hoc track). The mission is to implement the ranking methods described below by over-riding the built-in ranking models in Lucene.

1 Background Information

1.1 Dataset

Dataset used is a subset of the TREC (Text REtrieval Conference) ad-hoc track, Text Research Collection Volume 4, May 1996 and Text Research Collection Volume 5, April 1997. Document sets used are fbis, ft, and latimes.

1.2 Ranking Methods

Lucene BM25 (baseline)
LM-based Similarity (LMDirichletSimilarity or LMJelinekMercerSimilarity)
RM1 (Lavrenko & Croft, SIGIR 2001)
RM3 (Lavrenko & Croft, SIGIR 2001)

Sample implementations of RM1 and RM3 can be found from The Lemur Project / Galago.

Additional Information

To implement Lucene BM25, you can use the BM25Similarity class: https://lucene.apache.org/core/9_5_0/core/org/apache/lucene/search/similarities/BM25Similarity.html
To implement LM-based Similarity, you can use either of the Lucene implemented APIs below: https://lucene.apache.org/core/7_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html https://lucene.apache.org/core/7_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html
RM1 and RM3 are expansion methods. Therefore, you will need to retrieve the top-K documents first, and then process the top retrieved documents to extract the candidate words and estimate the probabilities needed for RM1 and RM3.
- To do the initial retrieval, you can use one of LMDirichletSimilarity or LMJelinekMercerSimilarity classes.
- Do not do a new search with expanded query terms, only re-rank the original retrieved list of documents. You might want to use a different value of K for computing the RM1 statistics, and for initial top-K retrieved list. For example, estimate probabilities using top-10 or top-50 documents, but retrieved list for re-ranking can be bigger, e.g., 100 or 1000 documents.

2 Implementation Details

2.1 Indexing the Data

Each file in the dataset has multiple documents, and the boundaries are marked by <DOC> and </DOC>. You can index the files using

java IndexFiles [INDEX_PATH] [DOCS_PATH]

2.2 Ranking the Data

To rank the data run the program using:

java ProbabilisticRanker [RANKING_METHOD] [INDEX_PATH] [QUERIES_PATH] [OUTPUT_FILE_PATH]

or run the created jar file, ProbabilisticRanker.jar:

java -jar ProbabilisticRanker.jar [RANKING_METHOD] [INDEX_PATH] [QUERIES_PATH] [OUTPUT_FILE_PATH]

RANKING_METHOD can be one of BM25 / LM / RM1 / RM3

2.2.1 Input Queries

The input query file is TREC-7 ad hoc and TREC-8 filtering topics. A copy can be found in eval/topics.351-400.

We need to parse and convert "topics" into queries:

Each topic is marked by <top> and </top>
We use the union of Title and Description fields as the query for each topic

2.2.2 Output Format

For each topic, return up to 1,000 documents, one result per line, tab-delimited, in following format:

topic_id  \t Q0 \t document_id \t rank \t score \t your_id

topic_id is the number specified in for each topic in the input query
Q0 is a constant
document_id is the value of the corresponding "DocId" for the document in index
rank is the rank of the document for the query
score is the score of the document for the query
your_id is just an identifier for you

Example result:

351      Q0      FT944-4100      386      12.699199         arjlal

2.2.3 Formal Evaluation

An automatic evaluation program from TREC is available in eval. The output file should be in the correct format as specified.

Example:

Assume we entered:

java -jar ProbabilisticRanker.jar BM25 "./index" "eval/topics.351-400" "results/BM25results.txt"

The program uses BM25 to search the index, retrieves the top 1000 documents for each one of the 50 queries, and stores the results (concatenated) in results/BM25results.txt as follows:

351     Q0      FT934-4848      1       28.310038        arjlal
351     Q0      FT921-6603      2       26.861269        arjlal
351     Q0      FT941-9999      3       26.669258        arjlal
...
351     Q0      FT941-7250      999     9.0101385        arjlal
351     Q0      FT942-11134     1000    9.006053         arjlal    # up to this point the results of the first query
352     Q0      FT943-904       1       21.337904        arjlal
352     Q0      FT934-11803     2       16.779995        arjlal
352     Q0      LA120290-0163   3       16.102392        arjlal
...
400     Q0      LA040690-0150   999     9.383813         arjlal
400     Q0      FT934-13571     1000    9.378035         arjlal    # query results ends

You can evaluate using:

eval/trec_eval.linux eval/qrels.trec7 results/BM25results.txt

3 Source Code Details

The source code can be found under src/, and maven dependencies in pom.xml.

Java application developed with SDK corretto-18.

3.1 Constants Used

smoothingValue - 0.5, alpha value for laplace smoothing
k - 50, number of docs from which terms have to be weighted
n - 10, number of terms with the largest weights to be added to expanded query
lambda - 0.5, lambda value used to score docs using RM3 method

3.2 File and Method Descriptions

IndexFiles.java - used for indexing set of documents
IndexUtils.java - contains utility functions for the pipeline

METHODS
- writeScoreDocToOutput() writes the document scores and ranks from a ScoreDoc[] type variable to output file
- writeDocMapToOutput() writes the document scores and ranks from a map of <Document ID, Document Score> to output file
- sortByValue() used to sort a map comparing its values
- streamToString() converts a stream object to a string
- stringPreProcessor() removes stop words from a string
- removeBlacklistedWords() removes blacklisted words from a string
- loadStopWords() returns a list of stop words; stop words taken from the Galago Project: https://sourceforge.net/p/lemur/galago/ci/bed290b74b78bfeba69d1cc24b469459e87fc717/tree/core/src/main/resources/stopwords/rmstop
ProbabilisticRanker.java - controller for the assignment; parses arguments, does initial retrievals for each query based on the ranking method, re ranks them if it's RM1 or RM3, and writes final ranks and scores to output
```
ProbabilisticRanker [RANKING_METHOD] [INDEX_PATH] [QUERIES_PATH] [OUTPUT_FILE_PATH]
```
RANKING_METHOD can be one of BM25 / LM / RM1 / RM3
QueryBuilder.java - creates a map of queries from the query file with query number as key and Title + Description as value

METHODS
- buildQueries() does above-mentioned task
RM1.java - Re ranks set of scored documents based on RM1 ranking method

METHODS
- reRank() re ranks a set of documents and returns a map of <Document ID, Document Score>
  
  Steps done:
  - remove stop words from query
  - get document query likelihoods (taken from LMDirichletSimilarity QLEs)
  - calculate term weights of top 50 documents
  - create expanded query using query + top 10 words
  - get new document scores for new expanded query
- getDocWiseTermCounts() for each term in the set of top 50 documents, find document wise frequency; ignores stop words and query words
- getDocWiseTermWeights() for each term in the set of top 50 documents, find document wise weight
- getTotalTermWeights() for each term in the set of top 50 documents, get their total weight (sum of term weight from each doc)
- getDocIDs() creates a map of Document IDs to Document names
- createDocTermVectors() creates a map of Document IDs to their term vector
- getDocumentQueryLikelihoods() creates a map from Document ID to their query likelihoods (taken from LMDirichletSimilarity QLEs)
- getFinalDocScores() re scores documents based on the term weights and document query likelihoods
- getExpandedQuery() creates the expanded query of old query + top 10 terms (ignoring terms already in query)
RM3.java - Re ranks a set of documents based on RM3 ranking method

METHODS
- reRank() re ranks a set of documents and returns a map of <Document ID, Document Score>
  
  Steps done:
  - gets RM1 document ranks
  - generates new scores that are (lambda * RM1score) + ((1 - lambda) * QLE); lambda value is 0.2
SearchFiles.java - searches the index and returns ScoreDocs[] object

METHODS
- searchIndex() searches the index and returns ScoreDocs[] objects
- createIndexReader() creates a lucene IndexReader for the index
- createSearcher() creates a lucene IndexSearcher for the index after setting a similarity based on ranking method
- createQuery() creates a lucene Query object for the query string

4 More Information

Results obtained for this program is available in results.

For additional information on the relevance models implemented, see relevance_models.

5 References

Lavrenko, V., & Croft, W. B. (2001). Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR01: 24th ACM/SIGIR International Conference on Research and Development in Information Retrieval. ACM. https://doi.org/10.1145/383952.383972

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
eval		eval
relevance_models		relevance_models
results		results
src/main/prob_ranker		src/main/prob_ranker
.gitignore		.gitignore
ProbabilisticRanker.jar		ProbabilisticRanker.jar
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Probabilistic Ranking

Contents

1 Background Information

1.1 Dataset

1.2 Ranking Methods

Additional Information

2 Implementation Details

2.1 Indexing the Data

2.2 Ranking the Data

2.2.1 Input Queries

2.2.2 Output Format

2.2.3 Formal Evaluation

3 Source Code Details

3.1 Constants Used

3.2 File and Method Descriptions

4 More Information

5 References

About

Releases

Packages

Languages

sin-of-sloth/probabilistic-ranking

Folders and files

Latest commit

History

Repository files navigation

Probabilistic Ranking

Contents

1 Background Information

1.1 Dataset

1.2 Ranking Methods

Additional Information

2 Implementation Details

2.1 Indexing the Data

2.2 Ranking the Data

2.2.1 Input Queries

2.2.2 Output Format

2.2.3 Formal Evaluation

3 Source Code Details

3.1 Constants Used

3.2 File and Method Descriptions

4 More Information

5 References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages