Code switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation
Ching-Ting Chang, Shun-Po Chuang, Hung-Yi Lee
Code-switching is about dealing with alternative languages in speech or text. It is partially speaker-depend and domain-related, so completely explaining the phenomenon by linguistic rules is challenging. Compared to most monolingual tasks, insufficient data is an issue for code-switching. To mitigate the issue without expensive human annotation, we proposed an unsupervised method for code-switching data augmentation. By utilizing a generative adversarial network, we can generate intra-sentential code-switching sentences from monolingual sentences. We applied proposed method on two corpora, and the result shows that the generated code-switching sentences improve the performance of code-switching language models.
- Introduction
- Methodology
- Experimental setup
- Corpora
- Model Setup
- Results
- Code-switching Point Prediction
- Generated Text Quality
- Language Modeling
- Examples
- Conclusion
- LectureSS: The recording of “Signal and System” (SS) course by one Tai-wanese instructor at National Taiwan University in 2006.
- SEAME: South East Asia Mandarin-English, a conversational speech by Singapore and Malaysia speakers with almost balanced gender in Nanyang Technological University and Universities Sains Malaysia.
- Python packages
- python 3
- keras 2
- numpy
- jieba
- h5py
- tqdm
- Data
- text files
- Training set
- corpus/XXX/text/train.mono.txt: Mono sentences in H
- corpus/XXX/text/train.cs.txt: CS sentences
- Development set
- corpus/XXX/text/dev.mono.txt: Mono sentences in H translated from CS sentences (aligned to 2.)
- corpus/XXX/text/dev.cs.txt: CS sentences
- Testing set
- corpus/XXX/text/test.mono.txt: Mono sentences in H
- Note
- Sentences should be segmented into words by space.
- Words are based on H language
- If a word in H language is mapped to a phrase in G language, we use dash to connect the words into one word.
- Training set
- local/XXX/translator.txt: Translating table from H language to G language
- local/XXX/dict.txt: Word list for traning word-embedding
- local/postag.txt: POS tag list for traning pos-embedding
- text files
Type | Example |
---|---|
CS | Causality 這個 也是 你 所 讀 過 的 就是 指 我 output at-any-time 只 depend-on input |
Mono from CS in H | 因果性 這個 也是 你 所 讀 過 的 就是 指 我 輸出 在任意時間 只 取決於 輸入 |
- Note
- Mono: monolingual
- CS: code-switching
- H: host (language)
- G: guest (language)
- ASR: automatic speech recognition
- Use Jieba to get the part-of-speech (POS) tagger of text files for proposed + POS
- Path:
- Training set
- corpus/XXX/pos/train.mono.txt: POS of Mono sentences of training set
- corpus/XXX/pos/train.cs.txt: POS of CS sentences of training set
- Development set
- corpus/XXX/pos/dev.mono.txt: POS of Mono sentences of development set set
- Testing set
- corpus/XXX/pos/test.mono.txt: POS of Mono sentences of testing set
- Training set
- Path:
- Baselines:
- ZH
- EN
- Random
- Noun
- Precision
- Recall
- F-measure
- BLEU-1
- Word Error Rate (WER)
- Installation
- N-gram model
- Recurrent Neural Networks based Language Model (RNNLM)
It's the extended experiment which is not shown in paper.