README.md

Walk-Through for PHOENIX-2014T

This document describes the rough procedure to train a SLTUnet model.

Get the phoenix2014T dataset from here or using

wget https://www-i6.informatik.rwth-aachen.de/ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz

Get MuST-C En-De dataset from FBK; note we used the data in v1.0

We applied tokenization and subword modeling to these dataset. See preprocess_phoenix.sh for reference.

We adopt the SMKD method to pretrain sign embeddings and further adapt it for sign language translation. smkd shows the adapted source code.

To pretrain SMKD embeddings,

preprocess the dataset

python preprocess/dataset_preprocess.py --dataset phoenix2014 --dataset-root PHOENIX-2014
-T-release-v3/PHOENIX-2014-T/

launch training

python main.py --work-dir exp/resnet34 --config baseline.yaml --device 0,1

extract sign features

python main.py --load-weights avg/average.pt --phase features --device 0 --num-feature-aug 10 --work-dir exp/resnet34 --config baseline.yaml

Then combine different training features

python sign_feature_cmb.py train\*h5

At the end, you will have train/dev/test.h5 files as the sign feature inputs

See the given running scripts train.sh for reference.

we saved top-10 checkpoints based on dev set performance. we averaged them before final evaluation.
```
python checkpoint_averaging.py  --path path-to-best-ckpt-dir --checkpoints 10 --output avg --gpu 0
```
See the given running scripts test.sh for decoding.
Regarding evaluation, please checkout eval/metrics.py for details.

For future evaluation and dataset construction, we suggest retaining the punctuations and adopt detokenized BLEU. E.g.
```
python eval/metrics.py -t slt -hyp model-output-file -ref gold-reference-file
```