The FEVER2.0 shared task requires builders to submit Docker images (via dockerhub) as part of the competition to allow
for adversarial evaluation. Images must contain a single script to make predictions on a given input file using their model and host a web server (by installing the fever-api
pip package) to allow for interactive evaluation as part of the breaker phase of the competition.
This repository contains an example submission based on an AllenNLP implementation of the system (see fever-allennlp
). We go into depth for the following key information:
It can be run with the following commands. The first command creates a dummy container with the shared FEVER data that is used by the submission.
#Set up the data container (run once on first time)
docker create --name fever-common feverai/common
#Start a server for interactive querying of the FEVER system via the web API on port 5000
docker run --rm --volumes-from fever-common:ro -e CUDA_DEVICE=-1 -p 5000:5000 feverai/sample
#Alternatively, make predictions on a batch file and output it to `/out/predictions.jsonl` (set CUDA_DEVICE as appropriate)
docker run --rm --volumes-from fever-common:ro -e CUDA_DEVICE=-1 -v $(pwd):/out feverai/sample ./predict.sh /local/fever-common/data/fever-data/paper_dev.jsonl /out/predictions.jsonl
The FEVER2.0 submissions will be run in a shared environment where resources will be moderated. We urge participants to ensure that these shared resources are respected.
Tensorflow users are asked to implement per-process GPU memory limits: see this post. We will set an environment variable $TF_GPU_MEMORY_FRACTION
that will be tweaked for all systems in phase 2 of the shared task.
The prediction script should take 2 parameters as input: the path to input file to be predicted and the path the output file to be scored:
An optional CUDA_DEVICE
environment variable should be set
#!/usr/bin/env bash
default_cuda_device=0
root_dir=/local/fever-common
python -m fever.evidence.retrieve \
--index $root_dir/data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz \
--database $root_dir/data/fever/fever.db \
--in-file $1 \
--out-file /tmp/ir.$(basename $1) \
--max-page 5 \
--max-sent 5
python -m allennlp.run predict \
https://jamesthorne.co.uk/fever/fever-da.tar.gz \
/tmp/ir.$(basename $1) \
--output-file /tmp/labels.$(basename $1) \
--predictor fever \
--include-package fever.reader \
--cuda-device ${CUDA_DEVICE:-$default_cuda_device} \
--silent
python -m fever.submission.prepare \
--predicted_labels /tmp/labels.$(basename $1) \
--predicted_evidence /tmp/ir.$(basename $1) \
--out_file $2
The submission must run a flask web server to allow for interactive evaluation. In our application, the entrypoint is a function called my_sample_fever
in the module sample_application
(see sample_application.py
).
The my_sample_fever
function is a factory that returns a fever_web_api
object.
from fever.api.web_server import fever_web_api
def my_sample_fever(*args):
# Set up and initialize model
...
# A prediction function that is called by the API
def baseline_predict(instances):
predictions = []
for instance in instances:
predictions.append(...prediction for instance...)
return predictions
return fever_web_api(baseline_predict)
Your dockerfile can then use the waitress-serve
method as the entrypoint. This will start a wsgi server calling your factory method
CMD ["waitress-serve", "--host=0.0.0.0", "--port=5000", "--call", "sample_application:my_sample_fever"]
The web server is managed by the fever-api
package. No setup or modification is required by participants. We use the default flask port of 5000
and host a single endpoint on /predict
. We recommend using a client such as Postman to test your application.
POST /predict HTTP/1.1
Host: localhost:5000
Content-Type: application/json
{
"instances":[
{"id":0,"claim":"this is a test claim"},
{"id":1,"claim":"this is another test claim"},
]
}
In our sample submission, we present a simple method baseline_predict
method.
def baseline_predict(instances):
predictions = []
for instance in instances:
...prediction for instance...
predictions.append({"predicted_label":"SUPPORTS",
"predicted_evidence": [(Paris,0),(Paris,5)]})
return predictions
Inputs:
instances
- a list of dictionaries containing aclaim
Outputs:
- A list of dictionaries containing
predicted_label
(string in SUPPORTS/REFUTES/NOT ENOUGH INFO) andpredicted_evidence
(list of(page_name,line_number)
pairs as defined infever-scorer
.
We provide common data (the Wikipedia parse and the preprocessed data associated with the first FEVER challenge), that will be mounted in in /local/fever-common
It contains the following files (see fever.ai/resources.html for more info):
# Dataset
/local/fever-common/data/fever-data/train.jsonl
/local/fever-common/data/fever-data/paper_dev.jsonl
/local/fever-common/data/fever-data/paper_test.jsonl
/local/fever-common/data/fever-data/shared_task_dev.jsonl
/local/fever-common/data/fever-data/shared_task_test.jsonl
# Preprocessed Wikipedia Dump
/local/fever-common/data/fever/fever.db
# Wikipedia TF-IDF Index
/local/fever-common/data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz
# Preprocessed Wikipedia Pages (Alternative Format)
/local/fever-common/data/wiki-pages/wiki-000.jsonl
...
/local/fever-common/data/wiki-pages/wiki-109.jsonl