Normalize the silence duration of audio to make comma/silence trainable in DCTTS.
Since sometimes an audio clip contains multiple sentences, and each sentences sometimes have longer or shorter pause, it's necessary to pre-process audio data in order for it to be used in DCTTS.
It first split audio so that all silence goes away, and then insert back a fixed duration of silence between the split audio clips.
Usage X.py Geralt
- Place the respective character audio folder in the root and run split.py, the Geralt_output folder will be created containing the split clips.
- (optional) select audio clips that are really small(likely to be sign and hmm) and move into a folder named _test, e.g. Geralt_test
- (optional) run transcribe.py, it will transcribe all the clips, {voice}_transcription.csv will be created
- (optional) run move.py, it reads the transcription.csv and move all the files with transcription from test folder back to output folder
- (optional) run rename.py, it rename the remaining clips in test folder with the sentence as the filename for easier manual checking
- (optional) after checking and deleting, the remaining clips in test folder should be retained, run clean.py to rename it back to normal and move back to output folder
- run combine.py to merge the clips and insert fixed silence between clips. A folder {voice}_combined will be created.
convert_16k.sh to convert all audios to 16k (required for deepspeech transcribe) convert_22k.sh to convert all audios to 22k