This paper contains the code to reproduce the results in the paper: Comparing Score Aggregation Approaches for Pretrained Neural Language Models.
Note that since we highly rely on the ad-hoc retrieval framework capreolus,
the modules in this repo are mostly the extension of the framework (under ./capreolus_extensions
) and do not contain the data processing and training logic.
Please find the details in the framework github if you are interested in these.
The hyperparameters are listed under ./optimal_configs/maxp.txt
, with the format config_key=config_value
each line.
Feel free to try other settings using this format. Note that lines starting with #
is considered as comments and will be ignored by the program.
For the config key format and acceptable values, please find more details here.
Note: We realized that there is a bug in our code which will slightly effect the evalution results. The bug is fixed now, but this will cause the replicated results drops around 0.01. We will fix the number in the Tables before getting it published.
# install capreolus
pip install git+
# designate the directory under which the cache and results will be stored
If you use wandb, simply install wandb and login in the standard way:
pip install wandb
wandb login
Then the results can be easily synced to your project by simply adding --project_name your_wandb_project_name
You are expected to see all configs and the value of metric mAP
, P@20
, nDCG@20
The code is written in tensorflow-2.3 and supports GPU and TPU v2/v3 (by Capreolus).
This section provides the code to replicate all the experiments listed in the paper,
which can be also found under ./scripts
The following script is used to train the experiments with sampled dataset.
fold=s1 # s1-s5 for robust04, s1-s3 for gov2
dataset=rob04 # supports rob04 or gov2,
# since gov2 is not public dataset, the collection and built index need to be locally available, more in "Misc" section below
do_train=True # if False, training will be skipped. Acceptable if training results are already available
do_eval=True # if False, evaluation will be skipped. You can review it later using "--do_train=False --do_eval=True"
query_type=title # title or desc
aggregation=max # max, avg, first or sum
python \
--task bertology \
--dataset $dataset \
--query_type $query_type \
--fold $fold \
--aggregation $aggregation \
--train $do_train --eval $do_eval
When all folds results are available, the program will also show cross-validated results on the evaluation stage.
fold=s1 # s1-s5 for robust04, s1-s3 for gov2
dataset=rob04 # supports rob04 or gov2,
# since gov2 is not public dataset, the collection and built index need to be locally available, more in "Misc" section below
do_train=True # if False, training will be skipped. Acceptable if training results are already available
do_eval=True # if False, evaluation will be skipped. You can review it later using "--do_train=False --do_eval=True"
query_type=title # title or desc
pretrained=bert-base-uncased # one of electra-base-msmarco google/electra-base-discriminator bert-base-uncased albert-base-v2 roberta-base bert-large-uncased
aggregation=max # max, avg, first or sum
python \
--task bertology \
--dataset $dataset \
--pretrained $pretrained \
--query_type $query_type \
--fold $fold \
--train $do_train --eval $do_eval
fold=s1 # s1-s5 for Robust04, s1 to s3 for GOV2
dataset=rob04 # supports rob04 or gov2,
# since gov2 is not public dataset, the collection and built index need to be locally available, more in "Misc" section below
do_train=True # if False, training will be skipped. Acceptable if training results are already available
do_eval=True # if False, evaluation will be skipped. You can review it later using "--do_train=False --do_eval=True"
rate=1.0 # supports (0.0, 1.0], where 1.0 (default) means no sampling will be done
python \
--task sampling \
--dataset $dataset \
--sampling_rate $rate \
--fold $fold \
--train $do_train --eval $do_eval
If TPU is available, append the following arguments to the above scripts to run the experiments on TPU:
--tpu your_tpu_name --tpuzone your_tpu_zone (e.g. us-central1-f) --gs_bucket gs://your_gs_bucket_path
As GOV2 is not public dataset, users need to prepare the dataset to test on GOV2 dataset.
Once the collection is prepared, specify the GOV2 directory through --gov2_path /path/to/GOV2
where ls /path/to/GOV2
should present a list of subdirectories from GX000
to GX272