Skip to content

Commit

Permalink
Reduce documentation duplication for learned sparse models (#1765)
Browse files Browse the repository at this point in the history
Fold reproduction guides into regression documentation for MS MARCO v1 learned sparse models:
DeepImpact, uniCOIL+TIDLE, SPLADEv2
  • Loading branch information
lintool authored Feb 11, 2022
1 parent 6d8f494 commit c7614d2
Show file tree
Hide file tree
Showing 14 changed files with 282 additions and 320 deletions.
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,15 @@ cd tools/eval/ndeval && make && cd ../../..

With that, you should be ready to go!

## Regression Experiments
## Regression Experiments (+ Reproduction Guides)

Anserini is designed to support experiments on various standard IR test collections out of the box.
The following experiments are backed by [rigorous end-to-end regression tests](docs/regressions.md) with [`run_regression.py`](src/main/python/run_regression.py) and [the Anserini reproducibility promise](docs/regressions.md).
For the most part, these runs are based on [_default_ parameter settings](https://github.com/castorini/Anserini/blob/master/src/main/java/io/anserini/search/SearchArgs.java).

These pages can also serve as guides to reproduce our results.
See individual pages for details!

+ Regressions for [Disks 1 & 2 (TREC 1-3)](docs/regressions-disk12.md), [Disks 4 & 5 (TREC 7-8, Robust04)](docs/regressions-disk45.md), [AQUAINT (Robust05)](docs/regressions-robust05.md)
+ Regressions for [the New York Times Corpus (Core17)](docs/regressions-core17.md), [the Washington Post Corpus (Core18)](docs/regressions-core18.md)
+ Regressions for [Wt10g](docs/regressions-wt10g.md), [Gov2](docs/regressions-gov2.md)
Expand Down Expand Up @@ -92,7 +95,7 @@ For the most part, these runs are based on [_default_ parameter settings](https:
+ Regressions for FIRE 2012: [Monolingual Bengali](docs/regressions-fire12-bn.md), [Monolingual Hindi](docs/regressions-fire12-hi.md), [Monolingual English](docs/regressions-fire12-en.md)
+ Regressions for Mr. TyDi (v1.1): [ar](docs/regressions-mrtydi-v1.1-ar.md), [bn](docs/regressions-mrtydi-v1.1-bn.md), [en](docs/regressions-mrtydi-v1.1-en.md), [fi](docs/regressions-mrtydi-v1.1-fi.md), [id](docs/regressions-mrtydi-v1.1-id.md), [ja](docs/regressions-mrtydi-v1.1-ja.md), [ko](docs/regressions-mrtydi-v1.1-ko.md), [ru](docs/regressions-mrtydi-v1.1-ru.md), [sw](docs/regressions-mrtydi-v1.1-sw.md), [te](docs/regressions-mrtydi-v1.1-te.md), [th](docs/regressions-mrtydi-v1.1-th.md)

## Reproduction Guides
## Additional Documentation

The experiments described below are not associated with rigorous end-to-end regression testing and thus provide a lower standard of reproducibility.
For the most part, manual copying and pasting of commands into a shell is required to reproduce our results.
Expand All @@ -105,10 +108,6 @@ For the most part, manual copying and pasting of commands into a shell is requir
+ Reproducing [doc2query results](docs/experiments-doc2query.md) (MS MARCO Passage Ranking and TREC-CAR)
+ Reproducing [docTTTTTquery results](docs/experiments-docTTTTTquery.md) (MS MARCO Passage and Document Ranking)
+ Notes about reproduction issues with [MS MARCO Document Ranking w/ docTTTTTquery](docs/experiments-msmarco-doc-doc2query-details.md)
+ Reproducing experiments with sparse learned models for MS MARCO Passage Ranking:
+ [DeepImpact](docs/experiments-msmarco-passage-deepimpact.md), [uniCOIL with doc2query-T5](docs/experiments-msmarco-unicoil.md), [uniCOIL with TILDE](docs/experiments-msmarco-passage-unicoil-tilde-expansion.md), [SPLADEv2](docs/experiments-msmarco-passage-splade-v2.md)
+ Reproducing experiments with sparse learned models for MS MARCO Document Ranking:
+ [uniCOIL with doc2query-T5](docs/experiments-msmarco-unicoil.md)

### MS MARCO (V2)

Expand All @@ -132,7 +131,7 @@ For the most part, manual copying and pasting of commands into a shell is requir
+ Runbook for [ECIR 2019 paper on axiomatic semantic term matching](docs/runbook-ecir2019-axiomatic.md)
+ Runbook for [ECIR 2019 paper on cross-collection relevance feedback](docs/runbook-ecir2019-ccrf.md)

## Additional Documentation
### Other Features

+ Use Anserini in Python via [Pyserini](http://pyserini.io/)
+ Anserini integrates with SolrCloud via [Solrini](docs/solrini.md)
Expand Down
76 changes: 2 additions & 74 deletions docs/experiments-msmarco-passage-deepimpact.md
Original file line number Diff line number Diff line change
@@ -1,79 +1,7 @@
# Anserini: DeepImpact for MS MARCO V1 Passage Ranking

This page describes how to reproduce the DeepImpact experiments in the following paper:
This page previously hosted a guide on how to reproduce the DeepImpact experiments in the following paper:

> Antonio Mallia, Omar Khattab, Nicola Tonellotto, and Torsten Suel. [Learning Passage Impacts for Inverted Indexes.](https://dl.acm.org/doi/10.1145/3404835.3463030) _SIGIR 2021_.
Here, we start with a version of the MS MARCO passage corpus that has already been processed with DeepImpact, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved.

Note that Pyserini provides [a comparable reproduction guide](https://github.com/castorini/pyserini/blob/master/docs/experiments-deepimpact.md), so if you don't like Java, you can get _exactly_ the same results from Python.

## Data Prep

```bash
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/57AE5aAjzw2ox2n/download -O collections/msmarco-passage-deepimpact-b8.tar

tar xvf collections/msmarco-passage-deepimpact-b8.tar -C collections/
```

To confirm, `msmarco-passage-deepimpact-b8.tar` is ~3.6 GB and has MD5 checksum `3c317cb4f9f9bcd3bbec60f05047561a`.

## Indexing

We can now index these docs as a `JsonVectorCollection` using Anserini:

```bash
sh target/appassembler/bin/IndexCollection -collection JsonVectorCollection \
-input collections/msmarco-passage-deepimpact-b8/ \
-index indexes/lucene-index.msmarco-passage.deepimpact-b8 \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 18 -storeRaw
```

The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the DeepImpact tokens.

Upon completion, we should have an index with 8,841,823 documents.
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 15 minutes.

## Retrieval

To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries.
The queries are already stored in the repo, so we can run retrieval directly:

```bash
target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-passage.deepimpact-b8 \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.deepimpact.tsv.gz \
-output runs/run.msmarco-passage.deepimpact-b8.tsv -format msmarco \
-impact -pretokenized
```

Note that, mirroring the indexing options, we also specify `-impact -pretokenized` here.
Query evaluation is much slower than with bag-of-words BM25; a complete run takes around 30 minutes (on a single thread).

With `-format msmarco`, runs are already in the MS MARCO output format, so we can evaluate directly:

```bash
python tools/scripts/msmarco/msmarco_passage_eval.py \
collections/msmarco-passage/qrels.dev.small.tsv runs/run.msmarco-passage.deepimpact-b8.tsv
```

The results should be as follows:

```
#####################
MRR @10: 0.3252764133351524
QueriesRanked: 6980
#####################
```

The final evaluation metric is very close to the one reported in the paper (0.326).


## Reproduction Log[*](reproducibility.md)

+ Results reproduced by [@MXueguang](https://github.com/MXueguang) on 2021-06-17 (commit [`ff618db`](https://github.com/castorini/anserini/commit/ff618dbf87feee0ad75dc42c72a361c05984097d))
+ Results reproduced by [@JMMackenzie](https://github.com/jmmackenzie) on 2021-06-22 (commit [`490434`](https://github.com/castorini/anserini/commit/490434172a035b6eade8c17771aed83cc7f5d996))
+ Results reproduced by [@amyxie361](https://github.com/amyxie361) on 2021-06-22 (commit [`6f9352`](https://github.com/castorini/anserini/commit/6f9352fc5d6a4938fadc2bda9d0c428056eec5f0))
The guide has been integrated in [Anserini's regression framework](regressions-msmarco-passage-deepimpact.md), and this page has been reduced to a redirect stub.
78 changes: 2 additions & 76 deletions docs/experiments-msmarco-passage-splade-v2.md
Original file line number Diff line number Diff line change
@@ -1,81 +1,7 @@
# Anserini: SPLADEv2 for MS MARCO V1 Passage Ranking

This page describes how to reproduce the SPLADEv2 results with the DistilSPLADE-max model from the following paper:
This page previously hosted a guide on how to reproduce the SPLADEv2 results with the DistilSPLADE-max model from the following paper:

> Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant. [SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.](https://arxiv.org/abs/2109.10086) _arXiv:2109.10086_.
Here, we start with a version of the MS MARCO passage corpus that has already been processed with the model, i.e., gone through document expansion and term reweighting.
Thus, no neural inference is involved. As the model weights are provided in fp16, they have been converted to integers by taking the round of weight*100.

Note that Pyserini provides [a comparable reproduction guide](https://github.com/castorini/pyserini/blob/master/docs/experiments-spladev2.md), so if you don't like Java, you can get _exactly_ the same results from Python.

## Data Prep

We're going to use the repository's root directory as the working directory.
First, we need to download and extract the MS MARCO passage dataset with SPLADE processing:

```bash
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/poCLbJDMm7JxwPk/download -O collections/msmarco-passage-distill-splade-max.tar

tar xvf collections/msmarco-passage-distill-splade-max.tar -C collections/
```

To confirm, `msmarco-passage-distill-splade-max.tar` is ~9.8 GB and has MD5 checksum `95b89a7dfd88f3685edcc2d1ffb120d1`.

## Indexing

We can now index these docs as a `JsonVectorCollection` using Anserini:

```bash
sh target/appassembler/bin/IndexCollection -collection JsonVectorCollection \
-input collections/msmarco-passage-distill-splade-max \
-index indexes/lucene-index.msmarco-passage.distill-splade-max \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 12
```

The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens.

Upon completion, we should have an index with 8,841,823 documents.
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 30 minutes.

## Retrieval

To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries.
The queries are already stored in the repo, so we can run retrieval directly:

```bash
target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-passage.distill-splade-max \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz \
-output runs/run.msmarco-passage.distill-splade-max.tsv -format msmarco \
-impact -pretokenized
```

Note that, mirroring the indexing options, we also specify `-impact -pretokenized` here.
Query evaluation is much slower than with bag-of-words BM25; a complete run takes around 4 hours (on a single thread).
No, this isn't a mistake!
This model suffers from very slow queries with Lucene due to some yet unknown issue.
We're looking into it.

With `-format msmarco`, runs are already in the MS MARCO output format, so we can evaluate directly:

```bash
python tools/scripts/msmarco/msmarco_passage_eval.py \
tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.distill-splade-max.tsv
```

The results should be as follows:

```
#####################
MRR @10: 0.36852691363078205
QueriesRanked: 6980
#####################
```

This corresponds to the effectiveness reported in the paper.

## Reproduction Log[*](reproducibility.md)
+ Results reproduced by [@jmmackenzie](https://github.com/jmmackenzie) on 2021-10-15 (commit [`52b76f6`](https://github.com/castorini/anserini/commit/52b76f63b163036e8fad1a6e1b10b431b4ddd06c))
The guide has been integrated in [Anserini's regression framework](regressions-msmarco-passage-distill-splade-max.md), and this page has been reduced to a redirect stub.
87 changes: 2 additions & 85 deletions docs/experiments-msmarco-passage-unicoil-tilde-expansion.md
Original file line number Diff line number Diff line change
@@ -1,90 +1,7 @@
# Anserini: uniCOIL w/ TILDE for MS MARCO V1 Passage Ranking

This page describes how to reproduce experiments using uniCOIL with TILDE document expansion on the MS MARCO passage corpus, as described in the following paper:
This page previously hosted a guide on how to reproduce the uniCOIL + TILDE results on the MS MARCO passage corpus, as described in the following paper:

> Shengyao Zhuang and Guido Zuccon. [Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion.](https://arxiv.org/pdf/2108.08513) _arXiv:2108.08513_.
The original uniCOIL model is described here:

> Jimmy Lin and Xueguang Ma. [A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.](https://arxiv.org/abs/2106.14807) _arXiv:2106.14807_.
In the original uniCOIL paper, doc2query-T5 is used to perform document expansion, which is slow and expensive.
As an alternative, Zhuang and Zuccon proposed to use the TILDE model to expand the documents instead, resulting in a faster and cheaper process that is just as effective.
For details of how to use TILDE to expand documents, please refer to the [TILDE repo](https://github.com/ielab/TILDE).
For additional details on the original uniCOIL design (with doc2query-T5 expansion), please refer to the [COIL repo](https://github.com/luyug/COIL/tree/main/uniCOIL).

In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL + TILDE, i.e., gone through document expansion and term re-weighting.
Thus, no neural inference is involved.

Note that Pyserini provides [a comparable reproduction guide](https://github.com/castorini/pyserini/blob/master/docs/experiments-unicoil-tilde-expansion.md), so if you don't like Java, you can get _exactly_ the same results from Python.

## Data Prep

We're going to use the repository's root directory as the working directory.
First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:

```bash
# Alternate mirrors of the same data, pick one:
wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-tilde-expansion-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/6LECmLdiaBoPwrL/download -O collections/msmarco-passage-unicoil-tilde-expansion-b8.tar

tar xvf collections/msmarco-passage-unicoil-tilde-expansion-b8.tar -C collections/
```

To confirm, `msmarco-passage-unicoil-tilde-expansion-b8.tar` is ~3.9 GB and has MD5 checksum `be0a786033140ebb7a984a3e155c19ae`.

## Indexing

We can now index these docs as a `JsonVectorCollection` using Anserini:

```bash
sh target/appassembler/bin/IndexCollection -collection JsonVectorCollection \
-input collections/msmarco-passage-unicoil-tilde-expansion-b8/ \
-index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 12
```

The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.

Upon completion, we should have an index with 8,841,823 documents.
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 20 minutes.

## Retrieval

To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries.
The queries are already stored in the repo, so we can run retrieval directly:

```bash
target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz \
-output runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv -format msmarco \
-impact -pretokenized
```

Note that, mirroring the indexing options, we also specify `-impact -pretokenized` here.
Query evaluation is much slower than with bag-of-words BM25; a complete run takes around 30 minutes (on a single thread).

With `-format msmarco`, runs are already in the MS MARCO output format, so we can evaluate directly:

```bash
python tools/scripts/msmarco/msmarco_passage_eval.py \
tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv
```

The results should be as follows:

```
#####################
MRR @10: 0.34957184927457136
QueriesRanked: 6980
#####################
```

This corresponds to the effectiveness reported in the paper.

## Reproduction Log[*](reproducibility.md)

+ Results reproduced by [@MXueguang](https://github.com/MXueguang) on 2021-09-14 (commit [`a05fc52`](https://github.com/castorini/anserini/commit/a05fc5215a6d9de77bd5f4b8f874f608442024a3))
+ Results reproduced by [@jmmackenzie](https://github.com/jmmackenzie) on 2021-10-15 (commit [`52b76f6`](https://github.com/castorini/anserini/commit/52b76f63b163036e8fad1a6e1b10b431b4ddd06c))

The guide has been integrated in [Anserini's regression framework](regressions-msmarco-passage-unicoil-tilde-expansion.md), and this page has been reduced to a redirect stub.
Loading

0 comments on commit c7614d2

Please # to comment.