Skip to content

Latest commit

 

History

History
223 lines (142 loc) · 6.79 KB

File metadata and controls

223 lines (142 loc) · 6.79 KB

Information Retrieval Evaluation Metrics and Plots Computation

A python script that computes common Information Retrieval's metrics and creates a Precision-Recall curve plot for a given set of results.

Metrics computed by this script:

  • Set of Precision values for each query.
  • Set of Recall values for each query.
  • Average Precision (AvP) for each query.
  • Mean Average Precision (MAP) for all queries.

Plots:

  • Precision-Recall Curve for a single or multiple query (non-interpolated).

Table of Contents

  1. Installation Requirements
  2. Usage Examples
    1. Example 1: Single Query
    2. Example 2: Multiple Queries
  3. Evaluation in Information Retrieval Briefly Explained
    1. Precision and Recall
    2. Average Precision
    3. Mean Average Precision
    4. Precision-Recall Curves
    5. Practical Example
    6. Further Reading

To execute this script from the terminal you should have installed:

  • Python
  • Pip that allows to install python's packages
python -m pip install -U pip
pip install -U scikit-learn
pip install -U matplotlib

From the terminal, execute the script passing two arguments:

  1. Ordered set of relevant and non-relevant results.
  2. Total number of relevant documents on the collection for the information need.
python computeMetrics.py <orderedSet> <totalNumberRelevant>

You can also use this script to evaluate more than one query:

python computeMetrics.py <orderedSet1,orderedSet2,...,orderedSet7> <totalNumberRelevant1,totalNumberRelevant2,...,totalNumberRelevant7>

In order to compute the metrics for a single query, run on the terminal:

python computeMetrics.py RNRRRRNNNR 6

Again, the number '6' means that there are in total 6 documents in the collection that are relevant to the evaluated information need.

The script prints as result:

SET: RNRRRRNNNR

PRECISION: [1.0, 0.5, 0.6666666666666666, 0.75, 0.8, 0.8333333333333334, 0.7142857142857143, 0.625, 0.5555555555555556, 0.6]

RECALL: [0.16666666666666666, 0.16666666666666666, 0.3333333333333333, 0.5, 0.6666666666666666, 0.8333333333333334, 0.8333333333333334, 0.8333333333333334, 0.8333333333333334, 1.0]

AVERAGE PRECISION: 0.7749999999999999

MAP: 0.7749999999999999

And creates the Precision-Recall curve plot: p-r curve for example 1

In order to compute the metrics for multiple queries, run on the terminal:

python3 computeMetrics.py RNRNNRNNRR,NRNNRNRNNN 6,6

The script prints as result:

SET: RNRNNRNNRR

PRECISION: [1.0, 0.5, 0.6666666666666666, 0.5, 0.4, 0.5, 0.42857142857142855, 0.375, 0.4444444444444444, 0.5]

RECALL: [0.16666666666666666, 0.16666666666666666, 0.3333333333333333, 0.3333333333333333, 0.3333333333333333, 0.5, 0.5, 0.5, 0.6666666666666666, 0.8333333333333334]

AVERAGE PRECISION: 0.6222222222222221


SET: NRNNRNRNNN

PRECISION: [0.0, 0.5, 0.3333333333333333, 0.25, 0.4, 0.3333333333333333, 0.42857142857142855, 0.375, 0.3333333333333333, 0.3]

RECALL:[0.0, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.3333333333333333, 0.3333333333333333, 0.5, 0.5, 0.5, 0.5]

AVERAGE PRECISION: 0.44285714285714284

MAP: 0.5325396825396824

And creates the Precision-Recall curve plot: p-r curve for example 2

A standard approach to evaluate a Information retrieval system uses the notion of relevant and non-relevant documents for a information need.

The user performs a search in order to answer a information need. The results retrived by the system can be classified as relevant or non-relevant whether addresses the stated information need or not.

The most frequent metrics used in Information Retrieval evaluation are the concepts of Precision and Recall.

Precision is the fraction of retrieved documents that are relevant.

Precision = #(relevant items retrieved) / #(total retrieved items)

Recall is the fraction of relevant documents that are retrieved.

Recall = #(relevant items retrieved) / #(total relevant items in the collection)

Average Precision (AvP) provides a single-figure measure of quality across recall levels for a single query. It is the average of the precision value obtained after each relavant document is retrieved in the resuls lists.

Mean Average Precision (MAP) is the mean of AvP values for a given set of queries. It is a good measure for the quality of a system.

A Precision-Recall curve plots the tradeoff between precision and recall for different threshold.

A user wants information on whether eating fruits can improve his immune system (information need). The user searches for [fruits improve immune system] (query).

Let's assume that the system returned 5 results.

From the first result to the last one they were classified as - Relevant, Relevant, Non-Relevant, Relevant, Non-Relevant - to the user's information need.

This can be translated into the set RRNRN.

Let's also assume that in the indexed collection there are in total 8 results relevant to this information need.

Precision = (3 / 5) = 0.6
Recall = (3 / 8) = 0.375

We can compute the precision and recall value at each position of the 5 results.

Results R R N R N
Recall 1/8 2/8 2/8 3/8 3/8
Precision 1/1 2/2 2/3 3/4 3/5

The Precision-Recall curve is built with those sets of recall and precision values.

We can now compute the average precision (AvP) using the values of precision for relevant documents.

AvP = (1/1 + 2/2 + 3/4) / 3 = 0.91 

Let's assume that the user did a second search with a different information need for which the system displayed an average precision of 0.82.

We can compute the system's Mean Average Precision (MAP) for both information needs as:

MAP = (0.91 + 0.82) / 2 = 0.865

Manning, C., Raghavan, P., Schutze, H. (2009). An Introduction to Informational Retrieval. Cambridge University Press.