Skip to content

Curated corpora for Setswana. Used to train PuoBERTa.

License

Notifications You must be signed in to change notification settings

dsfsi/PuoData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PuoData: A curated corpora for Setswana

arXiv

Give Feedback 📑: DSFSI Resource Feedback Form

We believe that PuoData is a valuable resource for the Setswana language community. We hope that PuoData will be used to develop new and innovative applications that benefit the Setswana-speaking community.

Dataset Curation

Dataset Name Kind Num. of Tokens
PuoData
NCHLT Setswana \cite{eiselen2014developing} Government Documents 1,010,147
Nalibali Setswana Childrens Books 57,654
Setswana Bible Book(s) 879,630
SA Constitution Official Document 56,194
Leipzig Setswana Corpus BW Curated Dataset 219,149
Leipzig Setswana Corpus ZA Curated Dataset 218,037
SABC Dikgang tsa Setswana FB (Facebook) News Headlines 167,119
SABC MotswedingFM FB Online Content 33,092
Leipzig Setswana Wiki Online Content 230,333
Setswana Wiki Online Content 183,168
Vukuzenzele Monolingual TSN Government News 157,798
gov-za Cabinet speeches TSN Government Speeches 591,920
Department Basic Education TSN Education Material 708,965
PuoData Total 25MB on disk 4,513,206
PuoData+JW300
JW300 Setswana Book(s) 19,782,122
PuoData+JW300 124MB on disk 24,295,328

Dataset Uses

We used this corpus to train PuoBERTa, 🤗 https://huggingface.co/dsfsi/PuoBERTa. It is also part of the corpus used for PuoBERTaJW300.

Citation Information

Bibtex Reference

@inproceedings{marivate2023puoberta,
  title   = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
  author  = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
  year    = {2023},
  booktitle= {SACAIR 2023 (To Appear)},
  keywords = {NLP},
  preprint_url = {https://arxiv.org/abs/2310.09141},
  dataset_url = {https://github.com/dsfsi/PuoBERTa},
  software_url = {https://huggingface.co/dsfsi/PuoBERTa}
}

License

The license of PuoData is in CC-BY-SA-4.0. the monolingual data have difference licenses depending on the news website license

Dataset Contact

For more details, reach out or check our website.

Email: vukosi.marivate@cs.up.ac.za

Enjoy exploring Setswana through AI!