MahaVed is a comprehensive collection of datasets aimed at fostering advancements in speech technology. Below are curated datasets specifically tailored for Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) applications, supporting a wide range of Indian languages.
The following datasets are ideal for developing and testing Text-to-Speech (TTS) systems:
Name | Status | Owner | Language | Size | Link |
---|---|---|---|---|---|
KathBath | Ready to Use | AI4Bharat | Bengali, Gujarati, Kannada, Hindi, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu | 1684 hrs | Link |
Shrutilipi | Ready to Use | AI4Bharat | Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu | 6457 hrs | Link |
Indian Language Corpora | Ready to Use | IIT Bombay | Marathi | 109 hrs | Link |
Below is a collection of datasets suitable for Automatic Speech Recognition (ASR) system development, encompassing a variety of Indian languages and speech types:
Dataset | # Languages | Hours | # Speakers | # Domains | Type |
---|---|---|---|---|---|
CommonVoice[8] | 8 | - | 373 | 4 | Extempore |
FLEURS[17] | 13 | - | 163 | - | Read |
MSR[18] | 3 | 150 | 1286 | 1 | Conversation |
OpenSLR[19] | 3 | 618 | 1513 | - | Read |
CMS[20] | 6 | 35 | 243 | 1 | Conversation |
MUCS[21] | 3 | 351 | 158 | 4 | Extempore |
Kathbath[22] | 12 | 1684 | 1218 | 3 | Read |
Shrutilipi[23] | 12 | 6457 | - | - | Read |
Graamvaani2 | 1 | 1000 | 108 | - | Conversation |
IISc-Mile[24, 25] | 2 | 500 | 1446 | - | Read |
KDC3 | 1 | 1 | 43 | - | Conversation |
Vaksañcayah[26] | 1 | 78 | 27 | 8 | Extempore |
IIITH-ISD[27] | 7 | 11 | 35 | 1 | Read |
IITB-MSC[28] | 1 | 109 | 36 | 1 | Extempore |
SMC-MSC4 | 1 | 2 | 75 | 4 | - |
IITM5 | 3 | 690 | - | - | Read |
NPTEL[29] | 8 | 857 | - | 1 | Read |
IndicTTS[30] | 13 | 225 | 25 | 4 | Extempore |
Svarah[31] | 1 | 10 | 117 | 37 | Read/Extempore |
SPRING-INX[32] | 10 | 2005 | 7609 | 16 | Read/Extempore |
SPIRE-SIES[33] | 1 | 171 | 23 | 1607 | Conversation |
IndicVoices | 22 | 7348 | 1639 | 16237 | Read/Extempore/Conversation |
MahaVed is an open-source initiative aimed at propelling the advancements in speech technologies by providing a comprehensive catalogue of datasets. We welcome contributions from researchers, developers, and language enthusiasts worldwide. If you have a dataset that you'd like to share, or if you're interested in collaborating to expand the existing datasets, here's how you can contribute:
-
Submit New Datasets: If you have access to speech datasets, especially in underrepresented languages, and would like to include them in the MahaVed collection, please reach out to us with details about the dataset.
-
Documentation: Assist in documenting the datasets and contributing to tutorials or guides that help users understand and utilize the datasets effectively.
To get started with contributing to MahaVed, please follow these steps:
-
Explore Existing Datasets: Familiarize yourself with the datasets currently listed in MahaVed to understand the variety and scope of contributions.
-
Identify Gaps: Look for gaps in the collection where your contributions can make a significant impact, whether it's adding a new language, enhancing dataset quality, or providing technical tools.
-
Contact Us: Reach out to us by creating an issue on our GitHub repository or by sending an email to hi@dubverse.ai Please include details about your proposed contribution and how it aligns with the goals of MahaVed.
-
Follow Contribution Guidelines: Please adhere to our contribution guidelines, which are detailed in the CONTRIBUTING.md file in our repository, to ensure a smooth collaboration process.
Together, we can build a rich resource that accelerates innovation and inclusivity in speech technologies across a multitude of languages. Your contributions are invaluable to making MahaVed a truly comprehensive resource for the global research and development community.