This project focuses on correcting the FLORES evaluation dataset (dev and devtest) for four African languages: Hausa, Northern Sotho (Sepedi), Xitsonga, and isiZulu. The original dataset, though groundbreaking in covering low-resource languages, contained several inconsistencies and inaccuracies in these languages, which could affect the quality of evaluations in Natural Language Processing (NLP) tasks, especially for machine translation.
In this project, native speakers meticulously reviewed and corrected the dataset to ensure improved accuracy and reliability for each language. Our goal was to enhance the integrity of downstream NLP tasks that use this data.
- Reviewed and Corrected Errors: Identified and implemented corrections to translation inconsistencies and inaccuracies in the dataset.
- Statistical Analysis: Conducted statistical comparisons between the original and corrected datasets, highlighting the differences and improvements made.
- Improved Dataset Quality: Enhanced linguistic accuracy and reliability, ensuring more effective evaluation of NLP tasks involving these languages.
- Hausa: The Hausa translations revealed numerous inconsistencies, suggesting a significant portion may have been automatically generated, particularly due to unclear or incoherent phrasing. Comparisons with the Hausa FLORES dataset and Google Translate showed that many lexical choices were incorrect and aligned with Google’s outputs, raising concerns about the quality of the original translations. Additional issues included improper translations of named entities and the frequent omission of special characters in standardized Hausa alphabets.
- Northern Sotho (Sepedi): The translations in Northern Sotho displayed a need for improvement in vocabulary consistency, syntax, and the accurate conveyance of technical terms. While most text was accurately translated, minor corrections were necessary to enhance clarity, including adjustments for borrowed words and proper spacing. Notably, some translations omitted important terms, affecting the overall meaning, such as leaving out “scientific” when referring to tools.
- Xitsonga: In Xitsonga translations, several vocabulary accuracy issues and improper use of borrowed terms were identified, leading to misunderstandings. Errors included incorrect translations for phrases like "Type 1 diabetes" and uniform translations lacking contextual variation, which hindered clarity. Spelling errors and the inappropriate borrowing of terms significantly impacted translation quality, underscoring the need for proper native language usage.
- isiZulu: IsiZulu translations faced challenges with vocabulary inconsistencies, syntax errors, and issues in expressing technical terms, compounded by the language's agglutinative structure. Key problems included incorrect grammatical structures for time expressions and the unnecessary borrowing of English terms, which disrupted linguistic flow. Efforts to standardize terminology throughout the translations were made to ensure grammatical accuracy and clarity.
lang. | dev (997 sentences) | devtest (1,012 sentences) | |||||||||
#corr. (%) | #tokenso | #tokensc | Δ tokens | % div. | #corr. (%) | #tokenso | #tokensc | Δ tokens | % div. | ||
hau | 632 (63.4) | 17,948 | 18,073 | 125 | 24.7 | 70 (6.9) | 2,006 | 1,978 | 28 | 49.2 | |
nso | 67 (6.7) | 2,226 | 2,271 | 45 | 28.9 | 62 (6.1) | 2,082 | 2,105 | 23 | 28.0 | |
tso | - | - | - | - | - | 83 (6.1) | 2,919 | 2,947 | 28 | 27.4 | |
zul | 190 (19.1) | 3,605 | 3,588 | 17 | 23.7 | 226 (22.3) | 4,414 | 4,396 | 18 | 31.8 |
Table: Data statistics; #corr. (%) → number of sentences requiring at least one correction (percentage of original data); #tokenso → original token count; #tokensc → corrected token count; Δ tokens → token count difference; % div. → percentage of token divergence.
lang. | dev | devtest | |||||||
TER | BLEU | COMET | TER | BLEU | COMET | ||||
Score | #Edits | Score | #Edits | ||||||
hau | 19.2 | 3,107 | 72.0 | 54.1 | 40.4 | 711 | 56.6 | 42.1 | |
nso | 22.4 | 472 | 68.5 | 55.2 | 21.2 | 409 | 71.8 | 55.9 | |
tso | - | - | - | - | 20.9 | 547 | 73.9 | 58.4 | |
zul | 17.2 | 524 | 76.3 | 53.0 | 23.6 | 879 | 70.6 | 53.0 |
Table: Similarities between the original and corrected FLORES evaluation data on the four African languages - original as predictions; corrected as reference translations.
This repository contains the corrected version of the FLORES dataset for the four languages. You can use these corrected datasets for improved performance in evaluating machine translation and other NLP tasks for African languages.
We welcome contributions and suggestions to further enhance the dataset. If you would like to contribute, please submit a pull request or open an issue.
Special thanks to the native speaker annotators—university students and researchers—who volunteered to correct translations in their native languages. Their valuable contributions are crucial to the development and preservation of these low-resource languages in NLP.
If you use these corrections in your research, please cite our paper:
@misc{abdulmumin2024correctingfloresevaluationdataset,
title={Correcting FLORES Evaluation Dataset for Four African Languages},
author={Idris Abdulmumin and Sthembiso Mkhwanazi and Mahlatse S. Mbooi and Shamsuddeen Hassan Muhammad and Ibrahim Said Ahmad and Neo Putini and Miehleketo Mathebula and Matimba Shingange and Tajuddeen Gwadabe and Vukosi Marivate},
year={2024},
eprint={2409.00626},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.00626},
}
We hope these corrections will improve your NLP research and contribute to the growing body of work on African languages!