In this repository you will find the code for the TensorFlow 2.0 Question Answering Challenge. My team finished as 2nd / 1239 with a micro F1-score of 0.71
on the private test set. The challenge was to develop better algorithms for natural question answering. You can can find more details about the task and the dataset in these resources:
- https://www.kaggle.com/c/tensorflow2-question-answering/overview
- https://www.kaggle.com/c/tensorflow2-question-answering/discussion/127333
- https://storage.googleapis.com/pub-tools-public-publication-data/pdf/1f7b46b5378d757553d3e92ead36bda2e4254244.pdf
- https://arxiv.org/abs/1901.08634
Important: This solution uses cloud instances. Don't forget to stop / delete them if you don't need these instances anymore. Unwanted costs might occur otherwise.
-
GUI: To stop / delete your VMs go to: https://console.cloud.google.com ->
Compute Engine
->VM instances
. To stop / delete your TPU instances go to https://console.cloud.google.com ->Compute Engine
->TPUs
. -
CLI: To stop your VMs:
sudo shutdown -h now
. To stop your TPU instances:gcloud compute tpus stop nq2 --zone europe-west4-a
You can always check your running instances via: gcloud compute instances list
or using the console GUI.
From a machine that has gcloud
installed:
# Download the script to start the instances
wget https://raw.githubusercontent.com/see--/natural-question-answering/master/start_instance.py
python3 start_instance.py
This will create a VM named "nq2" with Python-3.6.9
and all the required libraries installed. It will also create a "v3-8" TPU in the zone "europe-west4-a" with the same name.
For the following commands, please ssh into your VM. You can get the command from https://console.cloud.google.com -> Compute Engine
-> VM instances
-> Connect
-> View gcloud command
.
It should be similar to:
gcloud beta compute --project "MYPROJECT" ssh --zone "europe-west4-a" "nq2"
sudo pip install --upgrade kaggle
mkdir .kaggle
# Replace "MYUSER" and "MYKEY" with your credentials. You can create them on:
# `https://www.kaggle.com` -> `My Account` -> `Create New API Token`
echo '{"username":"MYUSER","key":"MYKEY"}' > ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
kaggle competitions download -c tensorflow2-question-answering
for f in *.zip; do unzip $f; done
rm *.zip
# get the code
git clone https://github.com/see--/natural-question-answering.git
# install `transformers`
cd natural-question-answering/transformers_repo
sudo python3 setup.py install
cd ..
# move the data to the root directory
mv ../*.jsonl .
# run the training and evaluation
python3 nq_to_squad.py; python3 train_eval.py
# run inference on the test set
python3 nq_to_squad.py --fn simplified-nq-test.jsonl; python3 train_eval.py --do_not_train --predict_fn nq-test-v1.0.1.json
The training and evaluation should finish within 5 hours and you should get a local validation score of ~0.72
and public LB of ~0.73
. The weights are stored in nq_bert_uncased_0/checkpoint-015400/weights.h5
. For inference please check my Kaggle Notebook.
The trained model can be loaded from TensorFlow Hub. A simple usage example is given in the the demo script:
import os
import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer
def main():
os.system('wget https://github.com/see--/natural-question-answering/releases/download/v0.0.1/tokenizer_tf2_qa.zip')
os.system('unzip tokenizer_tf2_qa.zip')
tokenizer = BertTokenizer.from_pretrained('tokenizer_tf2_qa')
model = hub.load(
'https://github.com/see--/natural-question-answering/releases/download/v0.0.1/model.tar.gz')
questions = [
'How long did it take to find the answer?',
'What\'s the answer to the great question?',
'What\'s the name of the computer?']
paragraph = '''<p>The computer is named Deep Thought.</p>.
<p>After 46 million years of training it found the answer.</p>
<p>However, nobody was amazed. The answer was 42.</p>'''
for question in questions:
question_tokens = tokenizer.tokenize(question)
paragraph_tokens = tokenizer.tokenize(paragraph)
tokens = ['[CLS]'] + question_tokens + ['[SEP]'] + paragraph_tokens + ['[SEP]']
input_word_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1] * len(input_word_ids)
input_type_ids = [0] * (1 + len(question_tokens) + 1) + [1] * (len(paragraph_tokens) + 1)
input_word_ids, input_mask, input_type_ids = map(lambda t: tf.expand_dims(
tf.convert_to_tensor(t, dtype=tf.int32), 0), (input_word_ids, input_mask, input_type_ids))
outputs = model([input_word_ids, input_mask, input_type_ids])
# using `[1:]` will enforce an answer. `outputs[:][0][0]` is the ignored '[CLS]' token logit.
short_start = tf.argmax(outputs[0][0][1:]) + 1
short_end = tf.argmax(outputs[1][0][1:]) + 1
answer_tokens = tokens[short_start: short_end + 1]
answer = tokenizer.convert_tokens_to_string(answer_tokens)
print(f'Question: {question}')
print(f'Answer: {answer}')
if __name__ == '__main__':
main()
Parts of this solution are copied or modified from the following repositories:
- https://github.com/huggingface/transformers
- https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/fast-and-lean-data-science
Thanks to the authors!