The repository contains the code needed to reproduce the experiments presented in the EMNLP 2021 paper "Rethinking data augmentation for low-resource neural machine translation: a multi-task learning approach".
Create a Python virtualenv and activate it:
virtualenv -p python3.6 ~/envs/mtl-da
source ~/envs/mtl-da/bin/activate
Clone and init submodules:
git clone https://github.com/transducens/mtl-da-emnlp.git
cd mtl-da-emnlp
git submodule update --init --recursive
Install dependencies:
pip install -r requirements.txt
You can download all the corpora we used in our experiments as follows:
wget http://www.dlsi.ua.es/~vmsanchez/emnlp2021-data.tar.gz
tar xvzf emnlp2021-data.tar.gz
If you are going to add the "replace" or "mono" auxiliary tasks, you will need to install MGIZA++ as follows. You can skip this section if you are not going to produce synthetic data with these auxiliary tasks.
git clone https://github.com/moses-smt/mgiza.git
cd mgiza/mgizapp
mkdir build && cd build
cmake ..
make
ln -s $PWD/../scripts/merge_alignment.py $PWD/bin/merge_alignment.py
cd ../../..
Once finished, export the Bash environment variable MTLDA_MGIZAPP
with the path to the MGIZA++ installation directory. The training scripts make use of this environment variable.
export MTLDA_MGIZAPP=$PWD/mgiza/mgizapp/build/bin/
In order to train a baseline system, run the script shown below, where the Bash variables have the following meaning:
- $L1 and $L2: source and target languages codes. Use
en
for English,de
for German,he
for Hebrew andvi
for Vietnamese. - $PAIR: language pair. We always consider English as the first language of the pair, regardless of whether it acts as the source of the target language. Possible values are
en-de
,en-he
, anden-vi
. - $DIR: path to the directory that will be created during the training process and will contain files with the intermediate steps and results.
- $bpe: number of BPE merge operations. We used 10000 in all the experiments reported in the paper.
- $TRAINSET: training data to use.
iwslt
contains IWSLT training parallel data, whileiwsltbackt
also includes backtranslated monolingual English sentences extracted from TED Talks.
./train-baseline.sh $L1 $L2 $DIR $bpe data/$TRAINSET-$PAIR/train data/$TRAINSET-$PAIR/dev data/$TRAINSET-$PAIR/test
You can find the resulting BLEU and chrF++ scores in the file $DIR/eval/report-train
By default, the GPU 0 as shown by the the nvidia-smi
command will be used to train the system. If you want to use another GPU, prepend the string CUDA_VISIBLE_DEVICES=NUM_GPU
to the training command, as in the following example:
CUDA_VISIBLE_DEVICES=2 ./train-baseline.sh $L1 $L2 $DIR $bpe data/$TRAINSET-$PAIR/train data/$TRAINSET-$PAIR/dev data/$TRAINSET-$PAIR/test
The Bash variables have the same meaning as in the previous section, and we have a new one:
- $AUXTASK: use
rev
for training with the "reverse" auxiliary task andsrc
for training with the "source" auxiliary task.
./train-mtl1tasks.sh $L1 $L2 $DIR $bpe data/$TRAINSET-$PAIR/train data/$TRAINSET-$PAIR/dev data/$TRAINSET-$PAIR/test $AUXTASK
You can find the resulting BLEU and chrF++ scores in the file $DIR/eval/report-tune
. If that file does not exists because BLEU in the development set did not improve during finetuning, scores can be found in the file $DIR/eval/report-tune
.
The "token" and "swap" auxiliary tasks require an alpha parameter that controls the proportion of the sentence which is modified. This is the meaning of the Bash variables used in the script below:
- $AUXTASK: use
wrdp
for training with the "token" auxiliary task andswap
for training with the "swap" auxiliary task. - $ALPHA: proportion of the tokens in the target sentence that are modified. The best values are reported in the appendix of the paper.
./train-mtl1tasks.sh $L1 $L2 $DIR $bpe data/$TRAINSET-$PAIR/train data/$TRAINSET-$PAIR/dev data/$TRAINSET-$PAIR/test $AUXTASK $ALPHA
The "replace" task requires word-aligning the training data and extracting a bilingual lexicon from it. In addition to MGIZA++, we will need a working installation of Moses. Please follow the Moses official installation instructions. Once installed, export the Bash environment variable MTLDA_MOSES with the path to the Moses root directory, as in the following example:
export MTLDA_MOSES=/home/myuser/software/mosesdecoder
Once the envitonment variables MTLDA_MOSES
and MTLDA_MGIZAPP
have been exported, you can train a system by issuing a command similar to the ones depicted for other auxiliary tasks:
./train-mtl1tasks.sh $L1 $L2 $DIR $bpe data/$TRAINSET-$PAIR/train data/$TRAINSET-$PAIR/dev data/$TRAINSET-$PAIR/test replace $ALPHA
Coming soon
Just run the same script as in the previous examples, and let the variable AUXTASK
contain the names of the tasks split by +
. For instance, if you want to train on the combination of the "reverse" and "replace" auxiliary tasks, define AUXTASK
as rev+replace
, as follows:
./train-mtl1tasks.sh $L1 $L2 $DIR $bpe data/$TRAINSET-$PAIR/train data/$TRAINSET-$PAIR/dev data/$TRAINSET-$PAIR/test rev+replace $ALPHA
Note that, currently, the value of the alpha parameter passed as argument is used for all the tasks. Hence, if you want to combine several tasks that have an alpha parameter (e.g. swap and replace) you cannot define a different value for each task.