***** Most of the code in this repository was copied from the original BERT repository.*****
This repository contains the code to reproduce our entry to the MSMARCO passage ranking task, which was placed first with a large margin over the second place.
MSMARCO Passage Re-Ranking Leaderboard (Jan 8th 2019) | Eval MRR@10 | Eval MRR@10 |
---|---|---|
1st Place - BERT (this code) | 35.87 | 36.53 |
2nd Place - IRNet | 28.06 | 27.80 |
3rd Place - Conv-KNRM | 27.12 | 29.02 |
The paper describing our implementation is here.
First, we need to download and extract MS MARCO and BERT files:
DATA_DIR=./data
mkdir ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/top1000.eval.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/qrels.dev.tsv -P ${DATA_DIR}
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip -P ${DATA_DIR}
tar -xvf ${DATA_DIR}/triples.train.small.tar.gz -C ${DATA_DIR}
tar -xvf ${DATA_DIR}/top1000.dev.tar.gz -C ${DATA_DIR}
tar -xvf ${DATA_DIR}/top1000.eval.tar.gz -C ${DATA_DIR}
unzip ${DATA_DIR}/uncased_L-24_H-1024_A-16.zip -d ${DATA_DIR}
Next, we need to convert MS MARCO train, dev, and eval files to tfrecord files, which will be later consumed by BERT.
mkdir ${DATA_DIR}/tfrecord
python convert_msmarco_to_tfrecord.py \
--tfrecord_folder=${DATA_DIR}/tfrecord \
--vocab_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/vocab.txt \
--train_dataset_path=${DATA_DIR}/triples.train.small.tsv \
--dev_dataset_path=${DATA_DIR}/top1000.dev.tsv \
--eval_dataset_path=${DATA_DIR}/top1000.eval.tsv \
--dev_qrels_path=${DATA_DIR}/qrels.dev.tsv \
--max_query_length=64\
--max_seq_length=512 \
--num_eval_docs=1000
This conversion takes 30-40 hours. Alternatively, you can download the tfrecord files here (~23GB).
We can now start training. We highly recommend using TPUs, which are free in Google's Colab. Otherwise, a modern V100 GPU with 16GB cannot fit even a small batch size of 2 when training a BERT Large model.
In case you opt for not using the Colab, here is the command line to start training:
python run.py \
--data_dir=${DATA_DIR}/tfrecord \
--bert_config_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/bert_config.json \
--init_checkpoint=${DATA_DIR}/uncased_L-24_H-1024_A-16/bert_model.ckpt \
--output_dir=${DATA_DIR}/output \
--msmarco_output=True \
--do_train=True \
--do_eval=True \
--num_train_steps=400000 \
--num_warmup_steps=40000 \
--train_batch_size=32 \
--eval_batch_size=32 \
--learning_rate=1e-6
Training for 400k iterations takes approximately 70 hours on a TPU v2. Alternatively, you can download the trained model used in our submission here (~3.4GB).
You can download our BERT Large trained on TREC-CAR here. We also made available a BERT Large model pretrained on the training set of TREC-CAR. This pretraining was necessary because the official pre-trained BERT models were pre-trained on the full Wikipedia, and therefore they have seen, although in an unsupervised way, Wikipedia documents that are used in the test set of TREC-CAR. Thus, to avoid this leak of test data into training, we pre-trained the BERT re-ranker only on the half of Wikipedia used by TREC-CAR’s training set.
@article{nogueira2019passage,
title={Passage Re-ranking with BERT},
author={Nogueira, Rodrigo and Cho, Kyunghyun},
journal={arXiv preprint arXiv:1901.04085},
year={2019}
}