Add multilingual dataset eval support #26

arda-argmax · 2024-12-04T00:07:51Z

After installing whisperkittools, you can run the following command to run evaluation on common voice 17 dataset:

whisperkit-evaluate-model --model-version <model_version> --output-dir <output_dir> --evaluation-dataset common_voice_17_0-argmax_subset-400 --pipeline WhisperKit --num-proc 1

Example CLI command

whisperkit-evaluate-model --model-version openai/whisper-tiny --output-dir out/ --evaluation-dataset common_voice_17_0-argmax_subset-400 --pipeline WhisperKit --num-proc 1

Additional args:

  --force-language      If specified, forces the language in each data sample
                        (if available)
  --language-subset LANGUAGE_SUBSET
                        If specified, filters the dataset for the given
                        language

To use only the en subset of the dataset, add --language-subset en CLI arg
To force the transcription task and language, add --language-subset <lang> --force-language

arda-argmax requested a review from atiorh December 4, 2024 00:07

add multilingual dataset eval support

a4360f2

arda-argmax force-pushed the arda/multilingual_eval branch from 19f13b0 to a4360f2 Compare December 4, 2024 18:34

arda-argmax added 6 commits December 4, 2024 10:54

add evaluate unit test

9dbbe44

skip speed tests

6551927

add folder evaluate unit test

752fff1

lint fix

0653285

lint fix and inference context fix

e56cd0a

fix model download and vad fail

4e3610e

atiorh approved these changes Dec 5, 2024

View reviewed changes

atiorh merged commit 6ce531c into main Dec 5, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multilingual dataset eval support #26

Add multilingual dataset eval support #26

arda-argmax commented Dec 4, 2024

Add multilingual dataset eval support #26

Add multilingual dataset eval support #26

Conversation

arda-argmax commented Dec 4, 2024