Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add multilingual dataset eval support #26

Merged
merged 7 commits into from
Dec 5, 2024
Merged

Conversation

arda-argmax
Copy link
Contributor

After installing whisperkittools, you can run the following command to run evaluation on common voice 17 dataset:

whisperkit-evaluate-model --model-version <model_version> --output-dir <output_dir> --evaluation-dataset common_voice_17_0-argmax_subset-400 --pipeline WhisperKit --num-proc 1

Example CLI command

whisperkit-evaluate-model --model-version openai/whisper-tiny --output-dir out/ --evaluation-dataset common_voice_17_0-argmax_subset-400 --pipeline WhisperKit --num-proc 1

Additional args:

  --force-language      If specified, forces the language in each data sample
                        (if available)
  --language-subset LANGUAGE_SUBSET
                        If specified, filters the dataset for the given
                        language

To use only the en subset of the dataset, add --language-subset en CLI arg
To force the transcription task and language, add --language-subset <lang> --force-language

@arda-argmax arda-argmax requested a review from atiorh December 4, 2024 00:07
@arda-argmax arda-argmax force-pushed the arda/multilingual_eval branch from 19f13b0 to a4360f2 Compare December 4, 2024 18:34
@atiorh atiorh merged commit 6ce531c into main Dec 5, 2024
1 check passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants