Passage Re-ranking with BERT

Introduction

***** Most of the code in this repository was copied from the original BERT repository.*****

This repository contains the code to reproduce our entry to the MSMARCO passage ranking task, which was placed first with a large margin over the second place.

MSMARCO Passage Re-Ranking Leaderboard (Jan 8th 2019)	Eval MRR@10	Eval MRR@10
1st Place - BERT (this code)	35.87	36.53
2nd Place - IRNet	28.06	27.80
3rd Place - Conv-KNRM	27.12	29.02

The paper describing our implementation is here.

Download and extract the data

First, we need to download and extract MS MARCO and BERT files:

DATA_DIR=./data
mkdir ${DATA_DIR}

wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/top1000.eval.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/qrels.dev.tsv -P ${DATA_DIR}
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip -P ${DATA_DIR}

tar -xvf ${DATA_DIR}/triples.train.small.tar.gz -C ${DATA_DIR}
tar -xvf ${DATA_DIR}/top1000.dev.tar.gz -C ${DATA_DIR}
tar -xvf ${DATA_DIR}/top1000.eval.tar.gz -C ${DATA_DIR}
unzip ${DATA_DIR}/uncased_L-24_H-1024_A-16.zip -d ${DATA_DIR}

Convert MS MARCO to tfrecord format

Next, we need to convert MS MARCO train, dev, and eval files to tfrecord files, which will be later consumed by BERT.

mkdir ${DATA_DIR}/tfrecord
python convert_msmarco_to_tfrecord.py \
  --tfrecord_folder=${DATA_DIR}/tfrecord \
  --vocab_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/vocab.txt \
  --train_dataset_path=${DATA_DIR}/triples.train.small.tsv \
  --dev_dataset_path=${DATA_DIR}/top1000.dev.tsv \
  --eval_dataset_path=${DATA_DIR}/top1000.eval.tsv \
  --dev_qrels_path=${DATA_DIR}/qrels.dev.tsv \
  --max_query_length=64\
  --max_seq_length=512 \
  --num_eval_docs=1000

This conversion takes 30-40 hours. Alternatively, you can download the tfrecord files here (~23GB).

Training

We can now start training. We highly recommend using TPUs, which are free in Google's Colab. Otherwise, a modern V100 GPU with 16GB cannot fit even a small batch size of 2 when training a BERT Large model.

In case you opt for not using the Colab, here is the command line to start training:

python run.py \
  --data_dir=${DATA_DIR}/tfrecord \
  --bert_config_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/bert_config.json \
  --init_checkpoint=${DATA_DIR}/uncased_L-24_H-1024_A-16/bert_model.ckpt \
  --output_dir=${DATA_DIR}/output \
  --msmarco_output=True \
  --do_train=True \
  --do_eval=True \
  --num_train_steps=400000 \
  --num_warmup_steps=40000 \
  --train_batch_size=32 \
  --eval_batch_size=32 \
  --learning_rate=1e-6

Training for 400k iterations takes approximately 70 hours on a TPU v2. Alternatively, you can download the trained model used in our submission here (~3.4GB).

TREC-CAR

You can download our BERT Large trained on TREC-CAR here. We also made available a BERT Large model pretrained on the training set of TREC-CAR. This pretraining was necessary because the official pre-trained BERT models were pre-trained on the full Wikipedia, and therefore they have seen, although in an unsupervised way, Wikipedia documents that are used in the test set of TREC-CAR. Thus, to avoid this leak of test data into training, we pre-trained the BERT re-ranker only on the half of Wikipedia used by TREC-CAR’s training set.

How do I cite this work?

@article{nogueira2019passage,
  title={Passage Re-ranking with BERT},
  author={Nogueira, Rodrigo and Cho, Kyunghyun},
  journal={arXiv preprint arXiv:1901.04085},
  year={2019}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
convert_msmarco_to_tfrecord.py		convert_msmarco_to_tfrecord.py
metrics.py		metrics.py
modeling.py		modeling.py
modeling_test.py		modeling_test.py
optimization.py		optimization.py
optimization_test.py		optimization_test.py
parameters.py		parameters.py
requirements.txt		requirements.txt
run.py		run.py
tokenization.py		tokenization.py
tokenization_test.py		tokenization_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Passage Re-ranking with BERT

Introduction

Download and extract the data

Convert MS MARCO to tfrecord format

Training

TREC-CAR

How do I cite this work?

About

Releases

Packages

Languages

License

KoU2N/dl4marco-bert

Folders and files

Latest commit

History

Repository files navigation

Passage Re-ranking with BERT

Introduction

Download and extract the data

Convert MS MARCO to tfrecord format

Training

TREC-CAR

How do I cite this work?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages