Give Feedback 📑: DSFSI Resource Feedback Form
We believe that PuoData is a valuable resource for the Setswana language community. We hope that PuoData will be used to develop new and innovative applications that benefit the Setswana-speaking community.
Dataset Name | Kind | Num. of Tokens |
---|---|---|
PuoData | ||
NCHLT Setswana \cite{eiselen2014developing} | Government Documents | 1,010,147 |
Nalibali Setswana | Childrens Books | 57,654 |
Setswana Bible | Book(s) | 879,630 |
SA Constitution | Official Document | 56,194 |
Leipzig Setswana Corpus BW | Curated Dataset | 219,149 |
Leipzig Setswana Corpus ZA | Curated Dataset | 218,037 |
SABC Dikgang tsa Setswana FB (Facebook) | News Headlines | 167,119 |
SABC MotswedingFM FB | Online Content | 33,092 |
Leipzig Setswana Wiki | Online Content | 230,333 |
Setswana Wiki | Online Content | 183,168 |
Vukuzenzele Monolingual TSN | Government News | 157,798 |
gov-za Cabinet speeches TSN | Government Speeches | 591,920 |
Department Basic Education TSN | Education Material | 708,965 |
PuoData Total | 25MB on disk | 4,513,206 |
PuoData+JW300 | ||
JW300 Setswana | Book(s) | 19,782,122 |
PuoData+JW300 | 124MB on disk | 24,295,328 |
We used this corpus to train PuoBERTa, 🤗 https://huggingface.co/dsfsi/PuoBERTa. It is also part of the corpus used for PuoBERTaJW300.
Bibtex Reference
@inproceedings{marivate2023puoberta,
title = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
author = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
year = {2023},
booktitle= {SACAIR 2023 (To Appear)},
keywords = {NLP},
preprint_url = {https://arxiv.org/abs/2310.09141},
dataset_url = {https://github.com/dsfsi/PuoBERTa},
software_url = {https://huggingface.co/dsfsi/PuoBERTa}
}
The license of PuoData is in CC-BY-SA-4.0. the monolingual data have difference licenses depending on the news website license
- License for Data - CC-BY-SA-4.0
For more details, reach out or check our website.
Email: vukosi.marivate@cs.up.ac.za
Enjoy exploring Setswana through AI!