This project demonstrates fine-tuning a multilingual Transformer (e.g., XLM-RoBERTa) on a subset of SQuAD v1.1 (English), then evaluating zero-shot performance on the XQuAD dataset (Spanish). By training only on English data, we test how well the model transfers its knowledge to another language without further fine-tuning.
-
Load and Subset SQuAD
- SQuAD v1.1 is a large English QA dataset.
- For demonstration and to keep training time low, we use a small subset (e.g., 200 examples for training, 50 for validation).
-
Load and Subset XQuAD
- XQuAD is a multilingual variant of SQuAD, translated into multiple languages.
- We focus on the Spanish split and again use a small subset (e.g., 50 examples) to reduce runtime.
-
Preprocess
- Tokenize
(context, question)
pairs and compute start/end positions for answers. - Keep crucial fields (
offset_mapping
,example_id
,context_text
) to reconstruct answers during post-processing.
- Tokenize
-
Train
- We use Hugging Face’s
Trainer
API to fine-tunexlm-roberta-base
for a few epochs on the English subset of SQuAD.
- We use Hugging Face’s
-
Evaluate
- Evaluate the model on (a) the English SQuAD validation subset, and (b) zero-shot on the Spanish XQuAD subset.
- Post-process predictions to convert logits into text answers, and compute Exact Match (EM) and F1 with the official SQuAD metric.
A typical structure might look like this:
.
├── .
│ ├── main.py <- Main script
| ├── train.py
│ ├── test.py
│ ├── consts.py
│ ├── README.md <- (this file)
│ └── requirements.txt <- optional, if you want to list exact dependencies
main.py
: The main code script with all the data loading, preprocessing, training, and evaluation logic.README.md
: This documentation.requirements.txt
: (Optional) A list of Python dependencies (e.g.,transformers
,datasets
,torch
, etc.).
-
Create and activate a virtual environment (recommended, but optional):
python -m venv env source env/bin/activate
-
Install the requirements:
pip install -r requirements.txt
-
Run the script:
python main.py
-
Expected output:
- The script will load a small subset of SQuAD (200 training examples, 50 validation examples).
- It will tokenize and compute start/end positions, then train for a small number of epochs (e.g., 1 epoch).
- It will evaluate on the English validation subset, printing out Exact Match and F1 scores.
- Finally, it will predict on the Spanish XQuAD subset (50 examples) and compute the same metrics zero-shot.
-
Training Data Size
- In
train.py
, you’ll see something like:Increase (or decrease) these values for more (or fewer) examples.train_dataset = squad_dataset["train"].select(range(200)) valid_dataset = squad_dataset["validation"].select(range(50))
- In
-
Number of Epochs
- Adjust
num_train_epochs
inTrainingArguments
to train longer on the small subset (or fewer if you just want a quick test).
- Adjust
-
Batch Size
- Change
per_device_train_batch_size
andper_device_eval_batch_size
to accommodate your GPU memory.
- Change
-
Different Languages
- The script loads Spanish from XQuAD by specifying
"es"
. - You can try
"ar"
,"hi"
,"zh"
, etc., if available in XQuAD or other multilingual QA datasets.
- The script loads Spanish from XQuAD by specifying
-
Larger Models
- Swap
model_name = "xlm-roberta-base"
for"xlm-roberta-large"
or"bert-base-multilingual-cased"
, if you have sufficient resources.
- Swap
-
Using the Full Dataset
- Remove or adjust the
.select(...)
lines if you want to train on all of SQuAD. But note that training time will significantly increase.
- Remove or adjust the
- Hugging Face Transformers
- Hugging Face Datasets (SQuAD)
- Hugging Face Datasets (XQuAD)
- SQuAD v1.1 Paper
- XQuAD Paper
This code references open-source datasets (SQuAD, XQuAD) and uses the Apache 2.0 license from Hugging Face Transformers. Ensure compliance with any dataset-specific licenses if you plan to use them for production or commercial research.
Feel free to open issues or discussions if you have questions about multilingual QA, zero-shot transfer, or fine-tuning larger language models. Contributions and ideas for improvement are always welcome!