Skip to content

dubverse-ai/MahaVed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

MahaVed Dataset Collection

MahaVed is a comprehensive collection of datasets aimed at fostering advancements in speech technology. Below are curated datasets specifically tailored for Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) applications, supporting a wide range of Indian languages.

Text-to-Speech (TTS) Datasets

The following datasets are ideal for developing and testing Text-to-Speech (TTS) systems:

Name Status Owner Language Size Link
KathBath Ready to Use AI4Bharat Bengali, Gujarati, Kannada, Hindi, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu 1684 hrs Link
Shrutilipi Ready to Use AI4Bharat Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu 6457 hrs Link
Indian Language Corpora Ready to Use IIT Bombay Marathi 109 hrs Link

Automatic Speech Recognition (ASR) Datasets

Below is a collection of datasets suitable for Automatic Speech Recognition (ASR) system development, encompassing a variety of Indian languages and speech types:

Dataset # Languages Hours # Speakers # Domains Type
CommonVoice[8] 8 - 373 4 Extempore
FLEURS[17] 13 - 163 - Read
MSR[18] 3 150 1286 1 Conversation
OpenSLR[19] 3 618 1513 - Read
CMS[20] 6 35 243 1 Conversation
MUCS[21] 3 351 158 4 Extempore
Kathbath[22] 12 1684 1218 3 Read
Shrutilipi[23] 12 6457 - - Read
Graamvaani2 1 1000 108 - Conversation
IISc-Mile[24, 25] 2 500 1446 - Read
KDC3 1 1 43 - Conversation
Vaksañcayah[26] 1 78 27 8 Extempore
IIITH-ISD[27] 7 11 35 1 Read
IITB-MSC[28] 1 109 36 1 Extempore
SMC-MSC4 1 2 75 4 -
IITM5 3 690 - - Read
NPTEL[29] 8 857 - 1 Read
IndicTTS[30] 13 225 25 4 Extempore
Svarah[31] 1 10 117 37 Read/Extempore
SPRING-INX[32] 10 2005 7609 16 Read/Extempore
SPIRE-SIES[33] 1 171 23 1607 Conversation
IndicVoices 22 7348 1639 16237 Read/Extempore/Conversation

Contribute to MahaVed

MahaVed is an open-source initiative aimed at propelling the advancements in speech technologies by providing a comprehensive catalogue of datasets. We welcome contributions from researchers, developers, and language enthusiasts worldwide. If you have a dataset that you'd like to share, or if you're interested in collaborating to expand the existing datasets, here's how you can contribute:

How to Contribute

  • Submit New Datasets: If you have access to speech datasets, especially in underrepresented languages, and would like to include them in the MahaVed collection, please reach out to us with details about the dataset.

  • Documentation: Assist in documenting the datasets and contributing to tutorials or guides that help users understand and utilize the datasets effectively.

Getting Started

To get started with contributing to MahaVed, please follow these steps:

  1. Explore Existing Datasets: Familiarize yourself with the datasets currently listed in MahaVed to understand the variety and scope of contributions.

  2. Identify Gaps: Look for gaps in the collection where your contributions can make a significant impact, whether it's adding a new language, enhancing dataset quality, or providing technical tools.

  3. Contact Us: Reach out to us by creating an issue on our GitHub repository or by sending an email to hi@dubverse.ai Please include details about your proposed contribution and how it aligns with the goals of MahaVed.

  4. Follow Contribution Guidelines: Please adhere to our contribution guidelines, which are detailed in the CONTRIBUTING.md file in our repository, to ensure a smooth collaboration process.

Together, we can build a rich resource that accelerates innovation and inclusivity in speech technologies across a multitude of languages. Your contributions are invaluable to making MahaVed a truly comprehensive resource for the global research and development community.

About

Collection of open source speech datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published