This repository contains the code implementation for SPL, as described in:
Localizing Open-Ontology QA Semantic Parsers in a Day Using Machine Translation
Mehrad Moradshahi, Giovanni Campagna, Sina J. Semnani, Silei Xu, Monica S. Lam
Semantic Parser Localizer (SPL) is a toolkit that leverages Neural Machine Translation (NMT) systems to localize a semantic parser for a new language. Our methodology is to (1) generate training data automatically in the target language by augmenting machine-translated datasets with local entities scraped from public websites, (2) add a few-shot boost of human-translated sentences and train a novel XLMR-LSTM semantic parser, and (3) test the model on natural utterances curated using human translators.
We assess the effectiveness of our approach by extending the current capabilities of Schema2QA, a system for English Question Answering (QA) on the open web, to 10 new languages for the restaurants and hotels domains. We show our approach outperforms the previous state-of-the-art methodology by more than 30% for hotels and 40% for restaurants with localized ontologies.
Our methodology enables any software developer to add a new language capability to a QA system for a new domain, leveraging machine translation, in less than 24 hours.
This plot shows the data generation pipeline used to produce train and validation splits in a new language such as Italian. Given an input sentence in English and its annotation in the formal ThingTalk query language, SPL generates multiple examples in the target language with localized entities.
Exact match accuracy for 10 languages (Restaurants domain)
- Clone the following repositories into your desired folder ($SRC_DIR):
cd ${SRC_DIR}
git clone https://github.com/Mehrad0711/SPL.git
git clone https://github.com/stanford-oval/genienlp.git
git clone https://github.com/stanford-oval/genie-toolkit.git
-
Follow the installation guide in genie-toolkit repo to install the dependencies.
-
For each repository, the following commit hashes should be used to reproduce the results:
- genienlp (c6ffb08742fed0c414d6ffc5eeae679cabdb20ff)
- genie-toolkit (7a74010f8c51c8b0dc1c7f5e604a8af742b00a29)
- genie-k8s (fb6a27a8945a3a43a563702f7809669f802a071f)
-
If you want to do translations for a new language, you need access to a Google Cloud account with permission to use Cloud Translation API. Please look here for a quick setup. You need to keep your credential file for authentication and create a project.
-
Navigate to
SPL
directory:
cd $SRC_DIR/SPL
-
Open project_config.mk and set paths to the location you have stored the repositories and files. You also need to provide the project_id you set up in step 3. This file contains the main configurations for running the experiments.
-
dataset
directory should contain splits (train/ eval/ test) of original English dataset per domain.
- Clean the current directory:
make deepclean
- Collect parameters and entities for your desired language: In project.config set
$(experiment)_$(language)_init_url
,$(experiment)_$(language)_base_url
,$(experiment)_$(language)_url_pattern
to websites you wish to crawl for their schemas. Then run the following command which produces raw schema in json format. (Run for both English and the foreign language)
make experiment=${experiment} crawl_target_size=100 schema_crawl_en
make experiment=${experiment} crawl_target_size=100 schema_crawl_${language}
- Create parameter-datasets for each experiment:
make experiment=${experiment} subset_languages=en ${experiment}/parameter-datasets.tsv
make -B experiment=${experiment} subset_languages=${language} ${experiment}/parameter-datasets.tsv
- Process English dataset and prepare data splits to be translated:
make -B process_data
This will create en
folder and perform multiple transformations on the splits of original dataset and stores the output files in corresponding subfolders.
- You now need to upload your dataset to a Google Cloud Storage. You can create one in your GC console if you don't have one already. You may set you project_id, project_number, and credential_file as defaults in
scripts/translate_v3.py
so that you don't have to pass it everytime you calltranslate_v3.py
Please follow these guidlines for bucket naming.
make input_bucket=${my_bucket} experiment=${experiment} ${experiment}/upload_data
- If you want to use a glossary for translation, you should set glossary_type to manual and put your glossary file here:
$(dataset_folder)/extras/
OR set glossary_type to default. Default mode will create a glossary file from your input data automatically by extracting the entities and using that token for all languages.
make glossary_type=${glossary_type} all_languages='${languages}' experiment=${experiment} ${experiment}/upload_glossary
- Now you are ready for translation:
make experiment=${experiment} input_bucket=${my_bucket} batch_translate_with_glossary_{language}
This will create ${experiment}/translated/${lang}
folder and perform several transformations on the splits of input dataset and stores them in corresponding subfolders.
These instructions are meant to be used with genie-k8s to train and evaluate models on kubernetes. You may adapt or use your own scripts for training/ evaluation.
- Training:
You can run multiple models on multiple datasets in parallel. First you need to create a text file containing the hyperparameter values for each experiment.
You should then put that file in this directory:$(multilingualdir)/extras/
and setrestaurants_train_args_name
to the file name.
You should also specify your model name prefixrestaurants_model_prefix
and the datasets you want to use for trainingrestaurants_train_datasets
. Finally you can evaluate each model on all training datasets by running:
make train-all
This command will use genie-k8s repo to run the experiments.
- Evaluation:
After training is done you can evaluate those models on your desired dev/ test datasets by settingrestaurants_eval_datasets
and running:
make eval-all
This command will run evaluation on both test and dev sets. If you want to run evaluation on only one split, you can do so by running:
make eval-all-${split_set}
- Once all the evaluations are done, you can retrieve the results by running:
make print-eval-results
This will print all the results in print-eval-results
file.
You can download our datasets and pretrained models for both domains by running:
make download_release_files
This will download and unzip the dataset and models to spl-release
directory. After unzipping, you will have one subfolder for each language.
English dataset contains train and eval splits. Train set contains both synthetic sentences and crowdsourced paraphrases. Eval split only contain crowdsourced examples.
Other languages contain train, eval, eval_mt, and eval_comb splits. Test splits are reserved for benchmarking. Train and eval_mt splits contain only machine translated sentences from the corresponding splits of English dataset. We chose ~2/3 of English eval examples and collected human translation for them. Eval split contains these human translated examples as well as machine translated examples for the remaining 1/3 (mainly to keep the size of eval set the same across experiments and avoid bias). We then combine eval_mt and eval to from eval_comb (this split is used for validation in the paper experiments). Test dataset is all human translated.
Please refer to our paper for more details on the dataset and experiments.
When running the code I getError: Cannot find module '/Users/Mehrad/Documents/genie-toolkit/tool/genie.js'
- Change 'geniedir' in project.config to point to the directory where you've downloaded genie-toolkit library.
Google translation process is taking longer than expected.
- You can query its status via HTTP calls. Please see link
The schema samples provided in schema_data_sampled
were taken from third-party websites, and is copyrighted by the respective author.
We're releasing this data for non-profit, educational purposes only. The copyright owner can contact us to have it taken down, if they so wish.
If you use the software in this repository, please cite:
@inproceedings{moradshahi-etal-2020-localizing,
title = "Localizing Open-Ontology {QA} Semantic Parsers in a Day Using Machine Translation",
author = "Moradshahi, Mehrad and Campagna, Giovanni and Semnani, Sina and Xu, Silei and Lam, Monica",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = November,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.481",
pages = "5970--5983",
}