diff --git a/README.md b/README.md index 4cf869313b..7e5a979358 100644 --- a/README.md +++ b/README.md @@ -38,12 +38,15 @@ cd tools/eval/ndeval && make && cd ../../.. With that, you should be ready to go! -## Regression Experiments +## Regression Experiments (+ Reproduction Guides) Anserini is designed to support experiments on various standard IR test collections out of the box. The following experiments are backed by [rigorous end-to-end regression tests](docs/regressions.md) with [`run_regression.py`](src/main/python/run_regression.py) and [the Anserini reproducibility promise](docs/regressions.md). For the most part, these runs are based on [_default_ parameter settings](https://github.com/castorini/Anserini/blob/master/src/main/java/io/anserini/search/SearchArgs.java). +These pages can also serve as guides to reproduce our results. +See individual pages for details! + + Regressions for [Disks 1 & 2 (TREC 1-3)](docs/regressions-disk12.md), [Disks 4 & 5 (TREC 7-8, Robust04)](docs/regressions-disk45.md), [AQUAINT (Robust05)](docs/regressions-robust05.md) + Regressions for [the New York Times Corpus (Core17)](docs/regressions-core17.md), [the Washington Post Corpus (Core18)](docs/regressions-core18.md) + Regressions for [Wt10g](docs/regressions-wt10g.md), [Gov2](docs/regressions-gov2.md) @@ -92,7 +95,7 @@ For the most part, these runs are based on [_default_ parameter settings](https: + Regressions for FIRE 2012: [Monolingual Bengali](docs/regressions-fire12-bn.md), [Monolingual Hindi](docs/regressions-fire12-hi.md), [Monolingual English](docs/regressions-fire12-en.md) + Regressions for Mr. TyDi (v1.1): [ar](docs/regressions-mrtydi-v1.1-ar.md), [bn](docs/regressions-mrtydi-v1.1-bn.md), [en](docs/regressions-mrtydi-v1.1-en.md), [fi](docs/regressions-mrtydi-v1.1-fi.md), [id](docs/regressions-mrtydi-v1.1-id.md), [ja](docs/regressions-mrtydi-v1.1-ja.md), [ko](docs/regressions-mrtydi-v1.1-ko.md), [ru](docs/regressions-mrtydi-v1.1-ru.md), [sw](docs/regressions-mrtydi-v1.1-sw.md), [te](docs/regressions-mrtydi-v1.1-te.md), [th](docs/regressions-mrtydi-v1.1-th.md) -## Reproduction Guides +## Additional Documentation The experiments described below are not associated with rigorous end-to-end regression testing and thus provide a lower standard of reproducibility. For the most part, manual copying and pasting of commands into a shell is required to reproduce our results. @@ -105,10 +108,6 @@ For the most part, manual copying and pasting of commands into a shell is requir + Reproducing [doc2query results](docs/experiments-doc2query.md) (MS MARCO Passage Ranking and TREC-CAR) + Reproducing [docTTTTTquery results](docs/experiments-docTTTTTquery.md) (MS MARCO Passage and Document Ranking) + Notes about reproduction issues with [MS MARCO Document Ranking w/ docTTTTTquery](docs/experiments-msmarco-doc-doc2query-details.md) -+ Reproducing experiments with sparse learned models for MS MARCO Passage Ranking: - + [DeepImpact](docs/experiments-msmarco-passage-deepimpact.md), [uniCOIL with doc2query-T5](docs/experiments-msmarco-unicoil.md), [uniCOIL with TILDE](docs/experiments-msmarco-passage-unicoil-tilde-expansion.md), [SPLADEv2](docs/experiments-msmarco-passage-splade-v2.md) -+ Reproducing experiments with sparse learned models for MS MARCO Document Ranking: - + [uniCOIL with doc2query-T5](docs/experiments-msmarco-unicoil.md) ### MS MARCO (V2) @@ -132,7 +131,7 @@ For the most part, manual copying and pasting of commands into a shell is requir + Runbook for [ECIR 2019 paper on axiomatic semantic term matching](docs/runbook-ecir2019-axiomatic.md) + Runbook for [ECIR 2019 paper on cross-collection relevance feedback](docs/runbook-ecir2019-ccrf.md) -## Additional Documentation +### Other Features + Use Anserini in Python via [Pyserini](http://pyserini.io/) + Anserini integrates with SolrCloud via [Solrini](docs/solrini.md) diff --git a/docs/experiments-msmarco-passage-deepimpact.md b/docs/experiments-msmarco-passage-deepimpact.md index 11aedb2410..39a7a3f73b 100644 --- a/docs/experiments-msmarco-passage-deepimpact.md +++ b/docs/experiments-msmarco-passage-deepimpact.md @@ -1,79 +1,7 @@ # Anserini: DeepImpact for MS MARCO V1 Passage Ranking -This page describes how to reproduce the DeepImpact experiments in the following paper: +This page previously hosted a guide on how to reproduce the DeepImpact experiments in the following paper: > Antonio Mallia, Omar Khattab, Nicola Tonellotto, and Torsten Suel. [Learning Passage Impacts for Inverted Indexes.](https://dl.acm.org/doi/10.1145/3404835.3463030) _SIGIR 2021_. -Here, we start with a version of the MS MARCO passage corpus that has already been processed with DeepImpact, i.e., gone through document expansion and term reweighting. -Thus, no neural inference is involved. - -Note that Pyserini provides [a comparable reproduction guide](https://github.com/castorini/pyserini/blob/master/docs/experiments-deepimpact.md), so if you don't like Java, you can get _exactly_ the same results from Python. - -## Data Prep - -```bash -# Alternate mirrors of the same data, pick one: -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact-b8.tar -P collections/ -wget https://vault.cs.uwaterloo.ca/s/57AE5aAjzw2ox2n/download -O collections/msmarco-passage-deepimpact-b8.tar - -tar xvf collections/msmarco-passage-deepimpact-b8.tar -C collections/ -``` - -To confirm, `msmarco-passage-deepimpact-b8.tar` is ~3.6 GB and has MD5 checksum `3c317cb4f9f9bcd3bbec60f05047561a`. - -## Indexing - -We can now index these docs as a `JsonVectorCollection` using Anserini: - -```bash -sh target/appassembler/bin/IndexCollection -collection JsonVectorCollection \ - -input collections/msmarco-passage-deepimpact-b8/ \ - -index indexes/lucene-index.msmarco-passage.deepimpact-b8 \ - -generator DefaultLuceneDocumentGenerator -impact -pretokenized \ - -threads 18 -storeRaw -``` - -The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the DeepImpact tokens. - -Upon completion, we should have an index with 8,841,823 documents. -The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 15 minutes. - -## Retrieval - -To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries. -The queries are already stored in the repo, so we can run retrieval directly: - -```bash -target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-passage.deepimpact-b8 \ - -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.deepimpact.tsv.gz \ - -output runs/run.msmarco-passage.deepimpact-b8.tsv -format msmarco \ - -impact -pretokenized -``` - -Note that, mirroring the indexing options, we also specify `-impact -pretokenized` here. -Query evaluation is much slower than with bag-of-words BM25; a complete run takes around 30 minutes (on a single thread). - -With `-format msmarco`, runs are already in the MS MARCO output format, so we can evaluate directly: - -```bash -python tools/scripts/msmarco/msmarco_passage_eval.py \ - collections/msmarco-passage/qrels.dev.small.tsv runs/run.msmarco-passage.deepimpact-b8.tsv -``` - -The results should be as follows: - -``` -##################### -MRR @10: 0.3252764133351524 -QueriesRanked: 6980 -##################### -``` - -The final evaluation metric is very close to the one reported in the paper (0.326). - - -## Reproduction Log[*](reproducibility.md) - -+ Results reproduced by [@MXueguang](https://github.com/MXueguang) on 2021-06-17 (commit [`ff618db`](https://github.com/castorini/anserini/commit/ff618dbf87feee0ad75dc42c72a361c05984097d)) -+ Results reproduced by [@JMMackenzie](https://github.com/jmmackenzie) on 2021-06-22 (commit [`490434`](https://github.com/castorini/anserini/commit/490434172a035b6eade8c17771aed83cc7f5d996)) -+ Results reproduced by [@amyxie361](https://github.com/amyxie361) on 2021-06-22 (commit [`6f9352`](https://github.com/castorini/anserini/commit/6f9352fc5d6a4938fadc2bda9d0c428056eec5f0)) +The guide has been integrated in [Anserini's regression framework](regressions-msmarco-passage-deepimpact.md), and this page has been reduced to a redirect stub. diff --git a/docs/experiments-msmarco-passage-splade-v2.md b/docs/experiments-msmarco-passage-splade-v2.md index df5ef3c5d1..350ca25ca6 100644 --- a/docs/experiments-msmarco-passage-splade-v2.md +++ b/docs/experiments-msmarco-passage-splade-v2.md @@ -1,81 +1,7 @@ # Anserini: SPLADEv2 for MS MARCO V1 Passage Ranking -This page describes how to reproduce the SPLADEv2 results with the DistilSPLADE-max model from the following paper: +This page previously hosted a guide on how to reproduce the SPLADEv2 results with the DistilSPLADE-max model from the following paper: > Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant. [SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.](https://arxiv.org/abs/2109.10086) _arXiv:2109.10086_. -Here, we start with a version of the MS MARCO passage corpus that has already been processed with the model, i.e., gone through document expansion and term reweighting. -Thus, no neural inference is involved. As the model weights are provided in fp16, they have been converted to integers by taking the round of weight*100. - -Note that Pyserini provides [a comparable reproduction guide](https://github.com/castorini/pyserini/blob/master/docs/experiments-spladev2.md), so if you don't like Java, you can get _exactly_ the same results from Python. - -## Data Prep - -We're going to use the repository's root directory as the working directory. -First, we need to download and extract the MS MARCO passage dataset with SPLADE processing: - -```bash -# Alternate mirrors of the same data, pick one: -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar -P collections/ -wget https://vault.cs.uwaterloo.ca/s/poCLbJDMm7JxwPk/download -O collections/msmarco-passage-distill-splade-max.tar - -tar xvf collections/msmarco-passage-distill-splade-max.tar -C collections/ -``` - -To confirm, `msmarco-passage-distill-splade-max.tar` is ~9.8 GB and has MD5 checksum `95b89a7dfd88f3685edcc2d1ffb120d1`. - -## Indexing - -We can now index these docs as a `JsonVectorCollection` using Anserini: - -```bash -sh target/appassembler/bin/IndexCollection -collection JsonVectorCollection \ - -input collections/msmarco-passage-distill-splade-max \ - -index indexes/lucene-index.msmarco-passage.distill-splade-max \ - -generator DefaultLuceneDocumentGenerator -impact -pretokenized \ - -threads 12 -``` - -The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens. - -Upon completion, we should have an index with 8,841,823 documents. -The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 30 minutes. - -## Retrieval - -To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries. -The queries are already stored in the repo, so we can run retrieval directly: - -```bash -target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-passage.distill-splade-max \ - -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz \ - -output runs/run.msmarco-passage.distill-splade-max.tsv -format msmarco \ - -impact -pretokenized -``` - -Note that, mirroring the indexing options, we also specify `-impact -pretokenized` here. -Query evaluation is much slower than with bag-of-words BM25; a complete run takes around 4 hours (on a single thread). -No, this isn't a mistake! -This model suffers from very slow queries with Lucene due to some yet unknown issue. -We're looking into it. - -With `-format msmarco`, runs are already in the MS MARCO output format, so we can evaluate directly: - -```bash -python tools/scripts/msmarco/msmarco_passage_eval.py \ - tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.distill-splade-max.tsv -``` - -The results should be as follows: - -``` -##################### -MRR @10: 0.36852691363078205 -QueriesRanked: 6980 -##################### -``` - -This corresponds to the effectiveness reported in the paper. - -## Reproduction Log[*](reproducibility.md) -+ Results reproduced by [@jmmackenzie](https://github.com/jmmackenzie) on 2021-10-15 (commit [`52b76f6`](https://github.com/castorini/anserini/commit/52b76f63b163036e8fad1a6e1b10b431b4ddd06c)) +The guide has been integrated in [Anserini's regression framework](regressions-msmarco-passage-distill-splade-max.md), and this page has been reduced to a redirect stub. diff --git a/docs/experiments-msmarco-passage-unicoil-tilde-expansion.md b/docs/experiments-msmarco-passage-unicoil-tilde-expansion.md index 56c3650e05..bd5990bf1f 100644 --- a/docs/experiments-msmarco-passage-unicoil-tilde-expansion.md +++ b/docs/experiments-msmarco-passage-unicoil-tilde-expansion.md @@ -1,90 +1,7 @@ # Anserini: uniCOIL w/ TILDE for MS MARCO V1 Passage Ranking -This page describes how to reproduce experiments using uniCOIL with TILDE document expansion on the MS MARCO passage corpus, as described in the following paper: +This page previously hosted a guide on how to reproduce the uniCOIL + TILDE results on the MS MARCO passage corpus, as described in the following paper: > Shengyao Zhuang and Guido Zuccon. [Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion.](https://arxiv.org/pdf/2108.08513) _arXiv:2108.08513_. -The original uniCOIL model is described here: - -> Jimmy Lin and Xueguang Ma. [A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.](https://arxiv.org/abs/2106.14807) _arXiv:2106.14807_. - -In the original uniCOIL paper, doc2query-T5 is used to perform document expansion, which is slow and expensive. -As an alternative, Zhuang and Zuccon proposed to use the TILDE model to expand the documents instead, resulting in a faster and cheaper process that is just as effective. -For details of how to use TILDE to expand documents, please refer to the [TILDE repo](https://github.com/ielab/TILDE). -For additional details on the original uniCOIL design (with doc2query-T5 expansion), please refer to the [COIL repo](https://github.com/luyug/COIL/tree/main/uniCOIL). - -In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL + TILDE, i.e., gone through document expansion and term re-weighting. -Thus, no neural inference is involved. - -Note that Pyserini provides [a comparable reproduction guide](https://github.com/castorini/pyserini/blob/master/docs/experiments-unicoil-tilde-expansion.md), so if you don't like Java, you can get _exactly_ the same results from Python. - -## Data Prep - -We're going to use the repository's root directory as the working directory. -First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing: - -```bash -# Alternate mirrors of the same data, pick one: -wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-tilde-expansion-b8.tar -P collections/ -wget https://vault.cs.uwaterloo.ca/s/6LECmLdiaBoPwrL/download -O collections/msmarco-passage-unicoil-tilde-expansion-b8.tar - -tar xvf collections/msmarco-passage-unicoil-tilde-expansion-b8.tar -C collections/ -``` - -To confirm, `msmarco-passage-unicoil-tilde-expansion-b8.tar` is ~3.9 GB and has MD5 checksum `be0a786033140ebb7a984a3e155c19ae`. - -## Indexing - -We can now index these docs as a `JsonVectorCollection` using Anserini: - -```bash -sh target/appassembler/bin/IndexCollection -collection JsonVectorCollection \ - -input collections/msmarco-passage-unicoil-tilde-expansion-b8/ \ - -index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \ - -generator DefaultLuceneDocumentGenerator -impact -pretokenized \ - -threads 12 -``` - -The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens. - -Upon completion, we should have an index with 8,841,823 documents. -The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 20 minutes. - -## Retrieval - -To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries. -The queries are already stored in the repo, so we can run retrieval directly: - -```bash -target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \ - -topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz \ - -output runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv -format msmarco \ - -impact -pretokenized -``` - -Note that, mirroring the indexing options, we also specify `-impact -pretokenized` here. -Query evaluation is much slower than with bag-of-words BM25; a complete run takes around 30 minutes (on a single thread). - -With `-format msmarco`, runs are already in the MS MARCO output format, so we can evaluate directly: - -```bash -python tools/scripts/msmarco/msmarco_passage_eval.py \ - tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv -``` - -The results should be as follows: - -``` -##################### -MRR @10: 0.34957184927457136 -QueriesRanked: 6980 -##################### -``` - -This corresponds to the effectiveness reported in the paper. - -## Reproduction Log[*](reproducibility.md) - -+ Results reproduced by [@MXueguang](https://github.com/MXueguang) on 2021-09-14 (commit [`a05fc52`](https://github.com/castorini/anserini/commit/a05fc5215a6d9de77bd5f4b8f874f608442024a3)) -+ Results reproduced by [@jmmackenzie](https://github.com/jmmackenzie) on 2021-10-15 (commit [`52b76f6`](https://github.com/castorini/anserini/commit/52b76f63b163036e8fad1a6e1b10b431b4ddd06c)) - +The guide has been integrated in [Anserini's regression framework](regressions-msmarco-passage-unicoil-tilde-expansion.md), and this page has been reduced to a redirect stub. diff --git a/docs/regressions-msmarco-passage-deepimpact.md b/docs/regressions-msmarco-passage-deepimpact.md index 0cea4e8b53..27134dac62 100644 --- a/docs/regressions-msmarco-passage-deepimpact.md +++ b/docs/regressions-msmarco-passage-deepimpact.md @@ -1,12 +1,10 @@ -# Anserini: Regressions for DeepImpact on [MS MARCO Passage](https://github.com/microsoft/MSMARCO-Passage-Ranking) +# Anserini: Regressions on MS MARCO Passage with DeepImpact -This page documents regression experiments for DeepImpact on the MS MARCO Passage Ranking Task, which is integrated into Anserini's regression testing framework. -DeepImpact is described in the following paper: +This page describes regression experiments, integrated into Anserini's regression testing framework, with DeepImpact on the [MS MARCO Passage Ranking Task](https://github.com/microsoft/MSMARCO-Passage-Ranking). +The DeepImpact model is described in the following paper: > Antonio Mallia, Omar Khattab, Nicola Tonellotto, and Torsten Suel. [Learning Passage Impacts for Inverted Indexes.](https://dl.acm.org/doi/10.1145/3404835.3463030) _SIGIR 2021_. -For more complete instructions on how to run end-to-end experiments, refer to [this page](experiments-msmarco-passage-deepimpact.md). - The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-passage-deepimpact.yaml). Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-passage-deepimpact.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. @@ -16,9 +14,33 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-deepimpact ``` +## Corpus + +We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). +Thus, no neural inference is involved. + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact.tar -P collections/ + +tar xvf collections/msmarco-passage-deepimpact.tar -C collections/ +``` + +To confirm, `msmarco-passage-deepimpact.tar` is 3.6 GB and has MD5 checksum `fe827eb13ca3270bebe26b3f6b99f550`. + +With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: + +``` +python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-deepimpact \ + --corpus-path collections/msmarco-passage-deepimpact +``` + +Alternatively, you can simply copy/paste from the commands below and obtain the same results. + ## Indexing -Typical indexing command: +Sample indexing command: ``` target/appassembler/bin/IndexCollection \ @@ -30,8 +52,10 @@ target/appassembler/bin/IndexCollection \ >& logs/log.msmarco-passage-deepimpact & ``` -The directory `/path/to/msmarco-passage-deepimpact/` should be a directory containing the compressed `jsonl` files that comprise the corpus. -See [this page](experiments-msmarco-passage-deepimpact.md) for additional details. +The path `/path/to/msmarco-passage-deepimpact/` should point to the corpus downloaded above. + +The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens. +Upon completion, we should have an index with 8,841,823 documents. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -80,12 +104,12 @@ In order to reproduce results reported in the paper, we need to convert to MS MA ```bash python tools/scripts/msmarco/convert_trec_to_msmarco_run.py \ - --input runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.tsv.gz \ - --output runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.tsv.gz.msmarco --quiet + --input runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.txt \ + --output runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.tsv --quiet python tools/scripts/msmarco/msmarco_passage_eval.py \ tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \ - runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.tsv.gz.msmarco + runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.tsv ``` The results should be as follows: @@ -98,3 +122,11 @@ QueriesRanked: 6980 ``` The final evaluation metric is very close to the one reported in the paper (0.326). + +## Reproduction Log[*](reproducibility.md) + +To add to this reproduction log, modify [this template](../src/main/resources/docgen/templates/msmarco-passage-deepimpact.template) and run `bin/build.sh` to rebuild the documentation. + ++ Results reproduced by [@MXueguang](https://github.com/MXueguang) on 2021-06-17 (commit [`ff618db`](https://github.com/castorini/anserini/commit/ff618dbf87feee0ad75dc42c72a361c05984097d)) ++ Results reproduced by [@JMMackenzie](https://github.com/jmmackenzie) on 2021-06-22 (commit [`490434`](https://github.com/castorini/anserini/commit/490434172a035b6eade8c17771aed83cc7f5d996)) ++ Results reproduced by [@amyxie361](https://github.com/amyxie361) on 2021-06-22 (commit [`6f9352`](https://github.com/castorini/anserini/commit/6f9352fc5d6a4938fadc2bda9d0c428056eec5f0)) diff --git a/docs/regressions-msmarco-passage-distill-splade-max.md b/docs/regressions-msmarco-passage-distill-splade-max.md index 5e37c62e1f..33a9e9c366 100644 --- a/docs/regressions-msmarco-passage-distill-splade-max.md +++ b/docs/regressions-msmarco-passage-distill-splade-max.md @@ -1,12 +1,10 @@ -# Anserini: Regressions for SPLADEv2 on [MS MARCO Passage](https://github.com/microsoft/MSMARCO-Passage-Ranking) +# Anserini: Regressions on MS MARCO Passage with DistilSPLADE-max -This page documents regression experiments for the DistilSPLADE-max model from SPLADEv2 on the MS MARCO Passage Ranking Task, which is integrated into Anserini's regression testing framework. -The model is described in the following paper: +This page describes regression experiments, integrated into Anserini's regression testing framework, with DistilSPLADE-max model from SPLADEv2 on the [MS MARCO Passage Ranking Task](https://github.com/microsoft/MSMARCO-Passage-Ranking). +The DistilSPLADE-max model is described in the following paper: > Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant. [SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.](https://arxiv.org/abs/2109.10086) _arXiv:2109.10086_. -For more complete instructions on how to run end-to-end experiments, refer to [this page](experiments-msmarco-passage-splade-v2.md). - The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-passage-distill-splade-max.yaml). Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-passage-distill-splade-max.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. @@ -16,9 +14,36 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-distill-splade-max ``` +## Corpus + +We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). +Thus, no neural inference is involved. + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar -P collections/ + +# Alternate mirror: +# wget https://vault.cs.uwaterloo.ca/s/poCLbJDMm7JxwPk/download -O collections/msmarco-passage-distill-splade-max.tar + +tar xvf collections/msmarco-passage-distill-splade-max.tar -C collections/ +``` + +To confirm, `msmarco-passage-distill-splade-max.tar` is 9.9 GB and has MD5 checksum `95b89a7dfd88f3685edcc2d1ffb120d1`. + +With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: + +``` +python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-distill-splade-max \ + --corpus-path collections/msmarco-passage-distill-splade-max +``` + +Alternatively, you can simply copy/paste from the commands below and obtain the same results. + ## Indexing -Typical indexing command: +Sample indexing command: ``` target/appassembler/bin/IndexCollection \ @@ -30,8 +55,10 @@ target/appassembler/bin/IndexCollection \ >& logs/log.msmarco-passage-distill-splade-max & ``` -The directory `/path/to/msmarco-passage-splade-v2/` should be a directory containing the compressed `jsonl` files that comprise the corpus. -See [this page](experiments-msmarco-passage-splade-v2.md) for additional details. +The path `/path/to/msmarco-passage-distill-splade-max/` should point to the corpus downloaded above. + +The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens. +Upon completion, we should have an index with 8,841,823 documents. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -80,12 +107,12 @@ In order to reproduce results reported in the paper, we need to convert to MS MA ```bash python tools/scripts/msmarco/convert_trec_to_msmarco_run.py \ - --input runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz \ - --output runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz.msmarco --quiet + --input runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.txt \ + --output runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.tsv --quiet python tools/scripts/msmarco/msmarco_passage_eval.py \ tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \ - runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz.msmarco + runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.tsv ``` The results should be as follows: @@ -97,4 +124,10 @@ QueriesRanked: 6980 ##################### ``` -This corresponds to the effectiveness reported in the paper. \ No newline at end of file +This corresponds to the effectiveness reported in the paper. + +## Reproduction Log[*](reproducibility.md) + +To add to this reproduction log, modify [this template](../src/main/resources/docgen/templates/msmarco-passage-distill-splade-max.template) and run `bin/build.sh` to rebuild the documentation. + ++ Results reproduced by [@jmmackenzie](https://github.com/jmmackenzie) on 2021-10-15 (commit [`52b76f6`](https://github.com/castorini/anserini/commit/52b76f63b163036e8fad1a6e1b10b431b4ddd06c)) diff --git a/docs/regressions-msmarco-passage-unicoil-tilde-expansion.md b/docs/regressions-msmarco-passage-unicoil-tilde-expansion.md index 4a5b6310dc..411f4746d3 100644 --- a/docs/regressions-msmarco-passage-unicoil-tilde-expansion.md +++ b/docs/regressions-msmarco-passage-unicoil-tilde-expansion.md @@ -1,12 +1,10 @@ -# Anserini: Regressions for uniCOIL w/ TILDE on [MS MARCO Passage](https://github.com/microsoft/MSMARCO-Passage-Ranking) +# Anserini: Regressions on MS MARCO Passage with uniCOIL+TILDE -This page documents regression experiments for uniCOIL w/ TILDE document expansion on the MS MARCO Passage Ranking Task, which is integrated into Anserini's regression testing framework. -The model is described in the following paper: +This page describes regression experiments, integrated into Anserini's regression testing framework, with uniCOIL+TILDE on the [MS MARCO Passage Ranking Task](https://github.com/microsoft/MSMARCO-Passage-Ranking). +The uniCOIL+TILDE model is described in the following paper: > Shengyao Zhuang and Guido Zuccon. [Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion.](https://arxiv.org/pdf/2108.08513) _arXiv:2108.08513_. -For more complete instructions on how to run end-to-end experiments, refer to [this page](experiments-msmarco-passage-unicoil-tilde-expansion.md). - The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml). Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-passage-unicoil-tilde-expansion.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. @@ -16,9 +14,33 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-unicoil-tilde-expansion ``` +## Corpus + +We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). +Thus, no neural inference is involved. + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-tilde-expansion.tar -P collections/ + +tar xvf collections/msmarco-passage-unicoil-tilde-expansion.tar -C collections/ +``` + +To confirm, `msmarco-passage-unicoil-tilde-expansion.tar` is 3.9 GB and has MD5 checksum `1685aee10071441987ad87f2e91f1706`. + +With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: + +``` +python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-unicoil-tilde-expansion \ + --corpus-path collections/msmarco-passage-unicoil-tilde-expansion +``` + +Alternatively, you can simply copy/paste from the commands below and obtain the same results. + ## Indexing -Typical indexing command: +Sample indexing command: ``` target/appassembler/bin/IndexCollection \ @@ -30,8 +52,10 @@ target/appassembler/bin/IndexCollection \ >& logs/log.msmarco-passage-unicoil-tilde-expansion & ``` -The directory `/path/to/msmarco-passage-unicoil-tilde-expansion/` should be a directory containing the compressed `jsonl` files that comprise the corpus. -See [this page](experiments-msmarco-passage-unicoil-tilde-expansion.md) for additional details. +The path `/path/to/msmarco-passage-unicoil-tilde-expansion/` should point to the corpus downloaded above. + +The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens. +Upon completion, we should have an index with 8,841,823 documents. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -80,12 +104,12 @@ In order to reproduce results reported in the paper, we need to convert to MS MA ```bash python tools/scripts/msmarco/convert_trec_to_msmarco_run.py \ - --input runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz \ - --output runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz.msmarco --quiet + --input runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.txt \ + --output runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv --quiet python tools/scripts/msmarco/msmarco_passage_eval.py \ tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \ - runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz.msmarco + runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv ``` The results should be as follows: @@ -97,4 +121,11 @@ QueriesRanked: 6980 ##################### ``` -This corresponds to the effectiveness reported in the paper. \ No newline at end of file +This corresponds to the effectiveness reported in the paper. + +## Reproduction Log[*](reproducibility.md) + +To add to this reproduction log, modify [this template](../src/main/resources/docgen/templates/msmarco-passage-unicoil-tilde-expansion.template) and run `bin/build.sh` to rebuild the documentation. + ++ Results reproduced by [@MXueguang](https://github.com/MXueguang) on 2021-09-14 (commit [`a05fc52`](https://github.com/castorini/anserini/commit/a05fc5215a6d9de77bd5f4b8f874f608442024a3)) ++ Results reproduced by [@jmmackenzie](https://github.com/jmmackenzie) on 2021-10-15 (commit [`52b76f6`](https://github.com/castorini/anserini/commit/52b76f63b163036e8fad1a6e1b10b431b4ddd06c)) diff --git a/docs/regressions-msmarco-passage-unicoil.md b/docs/regressions-msmarco-passage-unicoil.md index 6759df6f8e..ce1be80c4c 100644 --- a/docs/regressions-msmarco-passage-unicoil.md +++ b/docs/regressions-msmarco-passage-unicoil.md @@ -105,12 +105,12 @@ In order to reproduce results reported in the paper, we need to convert to MS MA ```bash python tools/scripts/msmarco/convert_trec_to_msmarco_run.py \ - --input runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.tsv.gz \ - --output runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.tsv.gz.msmarco --quiet + --input runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.txt \ + --output runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.tsv --quiet python tools/scripts/msmarco/msmarco_passage_eval.py \ tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \ - runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.tsv.gz.msmarco + runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.tsv ``` The results should be as follows: diff --git a/src/main/resources/docgen/templates/msmarco-passage-deepimpact.template b/src/main/resources/docgen/templates/msmarco-passage-deepimpact.template index 7cb2a41ff0..aaf96d9bcf 100644 --- a/src/main/resources/docgen/templates/msmarco-passage-deepimpact.template +++ b/src/main/resources/docgen/templates/msmarco-passage-deepimpact.template @@ -1,12 +1,10 @@ -# Anserini: Regressions for DeepImpact on [MS MARCO Passage](https://github.com/microsoft/MSMARCO-Passage-Ranking) +# Anserini: Regressions on MS MARCO Passage with DeepImpact -This page documents regression experiments for DeepImpact on the MS MARCO Passage Ranking Task, which is integrated into Anserini's regression testing framework. -DeepImpact is described in the following paper: +This page describes regression experiments, integrated into Anserini's regression testing framework, with DeepImpact on the [MS MARCO Passage Ranking Task](https://github.com/microsoft/MSMARCO-Passage-Ranking). +The DeepImpact model is described in the following paper: > Antonio Mallia, Omar Khattab, Nicola Tonellotto, and Torsten Suel. [Learning Passage Impacts for Inverted Indexes.](https://dl.acm.org/doi/10.1145/3404835.3463030) _SIGIR 2021_. -For more complete instructions on how to run end-to-end experiments, refer to [this page](experiments-msmarco-passage-deepimpact.md). - The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. @@ -16,16 +14,42 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf python src/main/python/run_regression.py --index --verify --search --regression ${test_name} ``` +## Corpus + +We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). +Thus, no neural inference is involved. + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact.tar -P collections/ + +tar xvf collections/msmarco-passage-deepimpact.tar -C collections/ +``` + +To confirm, `msmarco-passage-deepimpact.tar` is 3.6 GB and has MD5 checksum `fe827eb13ca3270bebe26b3f6b99f550`. + +With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: + +``` +python src/main/python/run_regression.py --index --verify --search --regression ${test_name} \ + --corpus-path collections/${corpus} +``` + +Alternatively, you can simply copy/paste from the commands below and obtain the same results. + ## Indexing -Typical indexing command: +Sample indexing command: ``` ${index_cmds} ``` -The directory `/path/to/msmarco-passage-deepimpact/` should be a directory containing the compressed `jsonl` files that comprise the corpus. -See [this page](experiments-msmarco-passage-deepimpact.md) for additional details. +The path `/path/to/${corpus}/` should point to the corpus downloaded above. + +The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens. +Upon completion, we should have an index with 8,841,823 documents. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -57,12 +81,12 @@ In order to reproduce results reported in the paper, we need to convert to MS MA ```bash python tools/scripts/msmarco/convert_trec_to_msmarco_run.py \ - --input runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.tsv.gz \ - --output runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.tsv.gz.msmarco --quiet + --input runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.txt \ + --output runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.tsv --quiet python tools/scripts/msmarco/msmarco_passage_eval.py \ tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \ - runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.tsv.gz.msmarco + runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.tsv ``` The results should be as follows: @@ -75,3 +99,11 @@ QueriesRanked: 6980 ``` The final evaluation metric is very close to the one reported in the paper (0.326). + +## Reproduction Log[*](reproducibility.md) + +To add to this reproduction log, modify [this template](${template}) and run `bin/build.sh` to rebuild the documentation. + ++ Results reproduced by [@MXueguang](https://github.com/MXueguang) on 2021-06-17 (commit [`ff618db`](https://github.com/castorini/anserini/commit/ff618dbf87feee0ad75dc42c72a361c05984097d)) ++ Results reproduced by [@JMMackenzie](https://github.com/jmmackenzie) on 2021-06-22 (commit [`490434`](https://github.com/castorini/anserini/commit/490434172a035b6eade8c17771aed83cc7f5d996)) ++ Results reproduced by [@amyxie361](https://github.com/amyxie361) on 2021-06-22 (commit [`6f9352`](https://github.com/castorini/anserini/commit/6f9352fc5d6a4938fadc2bda9d0c428056eec5f0)) diff --git a/src/main/resources/docgen/templates/msmarco-passage-distill-splade-max.template b/src/main/resources/docgen/templates/msmarco-passage-distill-splade-max.template index 7a702aeaa2..5f519f5083 100644 --- a/src/main/resources/docgen/templates/msmarco-passage-distill-splade-max.template +++ b/src/main/resources/docgen/templates/msmarco-passage-distill-splade-max.template @@ -1,12 +1,10 @@ -# Anserini: Regressions for SPLADEv2 on [MS MARCO Passage](https://github.com/microsoft/MSMARCO-Passage-Ranking) +# Anserini: Regressions on MS MARCO Passage with DistilSPLADE-max -This page documents regression experiments for the DistilSPLADE-max model from SPLADEv2 on the MS MARCO Passage Ranking Task, which is integrated into Anserini's regression testing framework. -The model is described in the following paper: +This page describes regression experiments, integrated into Anserini's regression testing framework, with DistilSPLADE-max model from SPLADEv2 on the [MS MARCO Passage Ranking Task](https://github.com/microsoft/MSMARCO-Passage-Ranking). +The DistilSPLADE-max model is described in the following paper: > Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant. [SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.](https://arxiv.org/abs/2109.10086) _arXiv:2109.10086_. -For more complete instructions on how to run end-to-end experiments, refer to [this page](experiments-msmarco-passage-splade-v2.md). - The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. @@ -16,16 +14,45 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf python src/main/python/run_regression.py --index --verify --search --regression ${test_name} ``` +## Corpus + +We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). +Thus, no neural inference is involved. + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar -P collections/ + +# Alternate mirror: +# wget https://vault.cs.uwaterloo.ca/s/poCLbJDMm7JxwPk/download -O collections/msmarco-passage-distill-splade-max.tar + +tar xvf collections/msmarco-passage-distill-splade-max.tar -C collections/ +``` + +To confirm, `msmarco-passage-distill-splade-max.tar` is 9.9 GB and has MD5 checksum `95b89a7dfd88f3685edcc2d1ffb120d1`. + +With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: + +``` +python src/main/python/run_regression.py --index --verify --search --regression ${test_name} \ + --corpus-path collections/${corpus} +``` + +Alternatively, you can simply copy/paste from the commands below and obtain the same results. + ## Indexing -Typical indexing command: +Sample indexing command: ``` ${index_cmds} ``` -The directory `/path/to/msmarco-passage-splade-v2/` should be a directory containing the compressed `jsonl` files that comprise the corpus. -See [this page](experiments-msmarco-passage-splade-v2.md) for additional details. +The path `/path/to/${corpus}/` should point to the corpus downloaded above. + +The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens. +Upon completion, we should have an index with 8,841,823 documents. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -57,12 +84,12 @@ In order to reproduce results reported in the paper, we need to convert to MS MA ```bash python tools/scripts/msmarco/convert_trec_to_msmarco_run.py \ - --input runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz \ - --output runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz.msmarco --quiet + --input runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.txt \ + --output runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.tsv --quiet python tools/scripts/msmarco/msmarco_passage_eval.py \ tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \ - runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz.msmarco + runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.tsv ``` The results should be as follows: @@ -74,4 +101,10 @@ QueriesRanked: 6980 ##################### ``` -This corresponds to the effectiveness reported in the paper. \ No newline at end of file +This corresponds to the effectiveness reported in the paper. + +## Reproduction Log[*](reproducibility.md) + +To add to this reproduction log, modify [this template](${template}) and run `bin/build.sh` to rebuild the documentation. + ++ Results reproduced by [@jmmackenzie](https://github.com/jmmackenzie) on 2021-10-15 (commit [`52b76f6`](https://github.com/castorini/anserini/commit/52b76f63b163036e8fad1a6e1b10b431b4ddd06c)) diff --git a/src/main/resources/docgen/templates/msmarco-passage-unicoil-tilde-expansion.template b/src/main/resources/docgen/templates/msmarco-passage-unicoil-tilde-expansion.template index d1293c07bc..cc55c2bfc9 100644 --- a/src/main/resources/docgen/templates/msmarco-passage-unicoil-tilde-expansion.template +++ b/src/main/resources/docgen/templates/msmarco-passage-unicoil-tilde-expansion.template @@ -1,12 +1,10 @@ -# Anserini: Regressions for uniCOIL w/ TILDE on [MS MARCO Passage](https://github.com/microsoft/MSMARCO-Passage-Ranking) +# Anserini: Regressions on MS MARCO Passage with uniCOIL+TILDE -This page documents regression experiments for uniCOIL w/ TILDE document expansion on the MS MARCO Passage Ranking Task, which is integrated into Anserini's regression testing framework. -The model is described in the following paper: +This page describes regression experiments, integrated into Anserini's regression testing framework, with uniCOIL+TILDE on the [MS MARCO Passage Ranking Task](https://github.com/microsoft/MSMARCO-Passage-Ranking). +The uniCOIL+TILDE model is described in the following paper: > Shengyao Zhuang and Guido Zuccon. [Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion.](https://arxiv.org/pdf/2108.08513) _arXiv:2108.08513_. -For more complete instructions on how to run end-to-end experiments, refer to [this page](experiments-msmarco-passage-unicoil-tilde-expansion.md). - The exact configurations for these regressions are stored in [this YAML file](${yaml}). Note that this page is automatically generated from [this template](${template}) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead. @@ -16,16 +14,42 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf python src/main/python/run_regression.py --index --verify --search --regression ${test_name} ``` +## Corpus + +We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). +Thus, no neural inference is involved. + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-tilde-expansion.tar -P collections/ + +tar xvf collections/msmarco-passage-unicoil-tilde-expansion.tar -C collections/ +``` + +To confirm, `msmarco-passage-unicoil-tilde-expansion.tar` is 3.9 GB and has MD5 checksum `1685aee10071441987ad87f2e91f1706`. + +With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: + +``` +python src/main/python/run_regression.py --index --verify --search --regression ${test_name} \ + --corpus-path collections/${corpus} +``` + +Alternatively, you can simply copy/paste from the commands below and obtain the same results. + ## Indexing -Typical indexing command: +Sample indexing command: ``` ${index_cmds} ``` -The directory `/path/to/msmarco-passage-unicoil-tilde-expansion/` should be a directory containing the compressed `jsonl` files that comprise the corpus. -See [this page](experiments-msmarco-passage-unicoil-tilde-expansion.md) for additional details. +The path `/path/to/${corpus}/` should point to the corpus downloaded above. + +The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doc lengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the SPLADEv2 tokens. +Upon completion, we should have an index with 8,841,823 documents. For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -57,12 +81,12 @@ In order to reproduce results reported in the paper, we need to convert to MS MA ```bash python tools/scripts/msmarco/convert_trec_to_msmarco_run.py \ - --input runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz \ - --output runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz.msmarco --quiet + --input runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.txt \ + --output runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv --quiet python tools/scripts/msmarco/msmarco_passage_eval.py \ tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \ - runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz.msmarco + runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv ``` The results should be as follows: @@ -74,4 +98,11 @@ QueriesRanked: 6980 ##################### ``` -This corresponds to the effectiveness reported in the paper. \ No newline at end of file +This corresponds to the effectiveness reported in the paper. + +## Reproduction Log[*](reproducibility.md) + +To add to this reproduction log, modify [this template](${template}) and run `bin/build.sh` to rebuild the documentation. + ++ Results reproduced by [@MXueguang](https://github.com/MXueguang) on 2021-09-14 (commit [`a05fc52`](https://github.com/castorini/anserini/commit/a05fc5215a6d9de77bd5f4b8f874f608442024a3)) ++ Results reproduced by [@jmmackenzie](https://github.com/jmmackenzie) on 2021-10-15 (commit [`52b76f6`](https://github.com/castorini/anserini/commit/52b76f63b163036e8fad1a6e1b10b431b4ddd06c)) diff --git a/src/main/resources/docgen/templates/msmarco-passage-unicoil.template b/src/main/resources/docgen/templates/msmarco-passage-unicoil.template index 571b2b68d5..f89b5fedb8 100644 --- a/src/main/resources/docgen/templates/msmarco-passage-unicoil.template +++ b/src/main/resources/docgen/templates/msmarco-passage-unicoil.template @@ -82,12 +82,12 @@ In order to reproduce results reported in the paper, we need to convert to MS MA ```bash python tools/scripts/msmarco/convert_trec_to_msmarco_run.py \ - --input runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.tsv.gz \ - --output runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.tsv.gz.msmarco --quiet + --input runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.txt \ + --output runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.tsv --quiet python tools/scripts/msmarco/msmarco_passage_eval.py \ tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \ - runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.tsv.gz.msmarco + runs/run.msmarco-passage-unicoil.unicoil.topics.msmarco-passage.dev-subset.unicoil.tsv ``` The results should be as follows: diff --git a/src/main/resources/regression/msmarco-passage-deepimpact.yaml b/src/main/resources/regression/msmarco-passage-deepimpact.yaml index 04a9fafc6b..0ec831c0c0 100644 --- a/src/main/resources/regression/msmarco-passage-deepimpact.yaml +++ b/src/main/resources/regression/msmarco-passage-deepimpact.yaml @@ -1,6 +1,6 @@ --- corpus: msmarco-passage-deepimpact -corpus_path: collections/msmarco/msmarco-passage-deepimpact-b8/ +corpus_path: collections/msmarco/msmarco-passage-deepimpact/ index_path: indexes/lucene-index.msmarco-passage-deepimpact/ collection_class: JsonVectorCollection diff --git a/src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml b/src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml index 0e9ebbe91d..50490e7e7e 100644 --- a/src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml +++ b/src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml @@ -1,6 +1,6 @@ --- corpus: msmarco-passage-unicoil-tilde-expansion -corpus_path: collections/msmarco/msmarco-passage-unicoil-tilde-expansion-b8/ +corpus_path: collections/msmarco/msmarco-passage-unicoil-tilde-expansion/ index_path: indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion/ collection_class: JsonVectorCollection