Update README.md

techiaith · Oct 28, 2021 · b3770c2 · b3770c2
1 parent 892a73d
commit b3770c2
Showing 1 changed file with 4 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -22,4 +22,8 @@ The texts were collected from various sources including out-of-copyright texts a
 
 Many of these texts were collected for input into Common Voice, a project by Mozilla that collects open data to create synthetic voices for world languages. This file therefore contains many of the same sentences found at https://github.com/techiaith/brawddegau-adnabod-lleferydd, but in addition to those, this corpus also contains many sentences that were too long for the needs of Common Voice needs, or which contained characters or other content that were unsuitable for the recording prompts.
 
+## October 2021 Addition
+We have added to the content of this corpus by selecting a subset of over 100k Welsh sentences from the CoVost Facebook corpus of machine translated English Common Voice sentences. This subset (originally intended to serve as recording prompts) was created by filtering out sentences that exceeded 15 words, contained digits, acronyms or abbreviations, or contained words not found in the Bangor Welsh Lexicon (with some exceptions). See https://github.com/techiaith/brawddegau-adnabod-lleferydd/blob/master/data/covost/README.md for more details. As these sentences were not originally written in Welsh, we have kept them separate in a second file, cy_covost_subset.txt, so you may decide whether or not to use them depending on your specific aims. Although these are machine translated sentences, a sample of the texts reviewed by human editors who found that less than 5% of the sentences were problematic (a figure that compares well to the situation with the original Welsh texts that are found on the web). We have found these sentences to be useful as they contain a selection topics and grammatical tenses and persons that are otherwise difficult to find within freely licensed texts. As a result, whilst we do not recommend using cy_covost_subset.txt texts for cultural and social linguistic analysis of the Welsh language, we believe that they are valuable for training monolingual Welsh language models where there would otherwise be insufficient original Welsh texts available.
+
+## Contributing
 You can help us increase the size of this corpus by donating any texts thatt you may own to us under the CC0 license so that they may be freely available. To do so, please contact techiaith@bangor.ac.uk.