This project focuses on building a phoneme-driven Text-To-Speech model. The model is designed to convert word pronunciations to phonetic representations, a crucial step in Speech To Text synthesis. The process involves converting lexical orthographic symbols (words) to phonetic sequences.
The project is organized into three main components:
Utilizes the CMU Pronouncing Dictionary for word-to-phoneme conversion. Tokenization of words and phonemes for training.
Uses the Mozilla Common Voice dataset for audio data. Generates phoneme sequences for each sentence using the previously trained word-to-phoneme model. Utilizes Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks.
Converts phoneme sequences back to word pronunciations. Employs LSTM networks for this task.
This project is a demonstration and may require additional tuning for optimal performance. Feel free to experiment and adapt the models based on your specific requirements.