Deepang Raval1 | Vyom Pathak1 | Muktan Patel1 | Brijesh Bhatt1
Dharmsinh Desai University, Nadiad1
We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning based approach which includes Convolutional Neural Network (CNN), Bi-directional Long Short Term Memory (BiLSTM) layers, Dense layers, and Connectionist Temporal Classification (CTC) as a loss function. In order to improve the performance of the system with the limited size of the dataset, we present a combined language model (WLM and CLM) based prefix decoding technique and Bidirectional Encoder Representations from Transformers (BERT) based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we proposed different analysis methods. These insights help to understand our ASR system based on a particular language (Gujarati) as well as can govern ASR systems' to improve the performance for low resource languages. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.11% decrease in Word Error Rate (WER) with respect to base-model WER.
If you find this work useful, please cite this work using the following BibTeX:
@inproceedings{raval-etal-2020-end,
title = "End-to-End Automatic Speech Recognition for {G}ujarati",
author = "Raval, Deepang and
Pathak, Vyom and
Patel, Muktan and
Bhatt, Brijesh",
booktitle = "Proceedings of the 17th International Conference on Natural Language Processing (ICON)",
month = dec,
year = "2020",
address = "Indian Institute of Technology Patna, Patna, India",
publisher = "NLP Association of India (NLPAI)",
url = "https://aclanthology.org/2020.icon-main.56",
pages = "409--419",
abstract = "We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning based approach which includes Convolutional Neural Network (CNN), Bi-directional Long Short Term Memory (BiLSTM) layers, Dense layers, and Connectionist Temporal Classification (CTC) as a loss function. In order to improve the performance of the system with the limited size of the dataset, we present a combined language model (WLM and CLM) based prefix decoding technique and Bidirectional Encoder Representations from Transformers (BERT) based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we proposed different analysis methods. These insights help to understand our ASR system based on a particular language (Gujarati) as well as can govern ASR systems{'} to improve the performance for low resource languages. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.11{\%} decrease in Word Error Rate (WER) with respect to base-model WER.",
}
- Linux OS
- Python-3.6
- TensorFlow-2.2.0
- CUDA-11.1
- CUDNN-7.6.5
git clone https://github.com/01-vyom/End_2_End_Automatic_Speech_Recognition_For_Gujarati.git
python -m venv asr_env
source $PWD/asr_env/bin/activate
Change directory to the root of the repository.
pip install --upgrade pip
pip install -r requirements.txt
Change directory to the root of the repository.
To train the model in the paper, run this command:
python ./Train/train.py
Note:
- If required change the variables
PathDataAudios
andPathDataTranscripts
to point to appropriate path to audio files and path to trascript file, in Train/feature_extractor.py file. - If required change the variable
currmodel
in Train/train.py file to change the model name that is being saved.
To inference using the model trained, run:
python ./Eval/inference.py
Note:
- Change the variables
PathDataAudios
andPathDataTranscripts
to point to appropriate path to audio files and path to trascript file for testing. - To change the name of the model for inferencing, change the variable
model
, and to change the name of file for testing, changetest_data
variable. - The output will be a
.pickle
of references and hypothesis with a model specific name stored in the./Eval/
folder.
To decode the infered output, run:
python ./Eval/decode.py
Note:
- To select a model specific
.pickle
change themodel
variable. - The output will be stored in
./Eval/
, specific to a model with all types of decoding and actual text.
For post-processing the decoded output, follow the steps mentioned in this README.
To perform the system analysis, run:
python ./System Analysis/system_analysis.py
Note:
-
To select a model specific decoding
.csv
file to analyze, change themodel
variable. -
To select a specific type of column (hypothesis type) to perform analysis, change the
type
variable. The output files will be saved in./System Analysis/
, specific to a model and type of decoding.
Our algorithm achieves the following performance:
Technique name | WER(%) reduction |
---|---|
Prefix with LMs' | 2.42 |
Prefix with LMs' + Spell Corrector BERT | 5.11 |
Note:
- These reductions in WER are w.r.t. the Greedy Decoding.
The prefix decoding code is based on 1 and 2 open-source implementations. The code for Bert based spell corrector is adapted from this open-source implementation
Licensed under the MIT License.