Skip to content

Better WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

License

Notifications You must be signed in to change notification settings

federicotorrielli/BetterWhisperX

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

BetterWhisperX

This is a fork of WhisperX with some improvements. Unfortunatelly, the original author is not maintaining the project anymore. I'm working on this project to keep it alive and improve it in my free time, but I can't promise anything!

Warning 🚨

The ctranslate2 version was updated to 4.5.0, which is not compatible with cuda and cudnn 8.x. If you are using a GOOGLE COLAB GPU, please install ctranslate2 4.4.0. In any other case, please install the latest cuda and cudnn versions on your environment.

If you still have problems, set the LD_LIBRARY_PATH to the cuda and cudnn paths, like this:

export LD_LIBRARY_PATH="/path/to/.venv/lib/python3.12/site-packages/nvidia/cudnn/lib

or, using fish:

set -gx LD_LIBRARY_PATH "/path/to/.venv/lib/python3.12/site-packages/nvidia/cudnn/lib"

since I'm using mamba, the libraries are located in /path/to/mambaforge/envs/name_of_env/lib

whisperx-arch

This repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.

  • ⚑️ Batched inference for 70x realtime transcription using whisper large-v2
  • πŸͺΆ faster-whisper backend, requires <8GB gpu memory for large-v2 with beam_size=5
  • 🎯 Accurate word-level timestamps using wav2vec2 alignment
  • πŸ‘―β€β™‚οΈ Multispeaker ASR using speaker diarization from pyannote-audio (speaker ID labels)
  • πŸ—£οΈ VAD preprocessing, reduces hallucination & batching with no WER degradation

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching.

Phoneme-Based ASR A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is wav2vec2.0.

Forced Alignment refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.

Voice Activity Detection (VAD) is the detection of the presence or absence of human speech.

Speaker Diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.

Setup βš™οΈ

Tested for PyTorch 2.0, Python 3.10 (use other versions at your own risk!)

1. Create Python3.10 environment

mamba create -n whisperx python=3.10

mamba activate whisperx

2. Install CUDA and cuDNN

mamba install cuda cudnn

3. Install this repo

pip install git+https://github.com/federicotorrielli/BetterWhisperX.git

If already installed, update package to most recent commit

pip install git+https://github.com/federicotorrielli/BetterWhisperX.git --upgrade

If wishing to modify this package, clone and install in editable mode:

git clone https://github.com/federicotorrielli/BetterWhisperX.git
cd BetterWhisperX
pip install -e .

You may also need to install ffmpeg, rust etc. Follow openAI instructions here.

Speaker Diarization

To enable Speaker Diarization, include your Hugging Face access token (read) that you can generate from Here after the --hf_token argument and accept the user agreement for the following models: Segmentation and Speaker-Diarization-3.1 (if you choose to use Speaker-Diarization 2.x, follow requirements here instead.)

Usage πŸ’¬ (command line)

English

Run whisper on example segment (using default params, whisper small) add --highlight_words True to visualise word timings in the .srt file.

whisperx examples/sample01.wav

Result using WhisperX with forced alignment to wav2vec2.0 large:

sample01.mp4

Compare this to original whisper out the box, where many transcriptions are out of sync:

sample_whisper_og.mov

For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.

whisperx examples/sample01.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4

To label the transcript with speaker ID's (set number of speakers if known e.g. --min_speakers 2 --max_speakers 2):

whisperx examples/sample01.wav --model large-v2 --diarize --highlight_words True

To run on CPU instead of GPU (and for running on Mac OS X):

whisperx examples/sample01.wav --compute_type int8

Other languages

The phoneme ASR alignment model is language-specific, for tested languages these models are automatically picked from torchaudio pipelines or huggingface. Just pass in the --language code, and use the whisper --model large.

Currently default models provided for {en, fr, de, es, it, ja, zh, nl, uk, pt}. If the detected language is not in this list, you need to find a phoneme-based ASR model from huggingface model hub and test it on your data.

E.g. German

whisperx --model large-v2 --language de examples/sample_de_01.wav
sample_de_01_vis.mov

See more examples in other languages here.

Python usage 🐍

import whisperx
import gc

device = "cuda"
audio_file = "audio.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

Technical Details πŸ‘·β€β™‚οΈ

For specific details on the batching and alignment, the effect of VAD, as well as the chosen alignment model, see the preprint paper.

To reduce GPU memory requirements, try any of the following (2. & 3. can affect quality):

  1. reduce batch size, e.g. --batch_size 4
  2. use a smaller ASR model --model base
  3. Use lighter compute type --compute_type int8

Transcription differences from openai's whisper:

  1. Transcription without timestamps. To enable single pass batching, whisper inference is performed --without_timestamps True, this ensures 1 forward pass per sample in the batch. However, this can cause discrepancies the default whisper output.
  2. VAD-based segment transcription, unlike the buffered transcription of openai's. In Wthe WhisperX paper we show this reduces WER, and enables accurate batched inference
  3. --condition_on_prev_text is set to False by default (reduces hallucination)

Limitations ⚠️

  • Transcript words which do not contain characters in the alignment models dictionary e.g. "2014." or "Β£13.60" cannot be aligned and therefore are not given a timing.
  • Overlapping speech is not handled particularly well by whisper nor whisperx
  • Diarization is far from perfect
  • Language specific wav2vec2 model is needed

Contribute πŸ§‘β€πŸ«

If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a pull request and some examples showing its success.

Bug finding and pull requests are also highly appreciated to keep this project going, since it's already diverging from the original research scope.

TODO πŸ—“

  • Multilingual init

  • Automatic align model selection based on language detection

  • Python usage

  • Incorporating speaker diarization

  • Model flush, for low gpu mem resources

  • Faster-whisper backend

  • Add max-line etc. see (openai's whisper utils.py)

  • Sentence-level segments (nltk toolbox)

  • Improve alignment logic

  • update examples with diarization and word highlighting

  • Subtitle .ass output <- bring this back (removed in v3)

  • Add benchmarking code (TEDLIUM for spd/WER & word segmentation)

  • Allow silero-vad as alternative VAD option

  • Improve diarization (word level). Harder than first thought...

Citation πŸ“š

If you use this in your research, please cite the paper that started this project:
@article{bain2022whisperx,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  journal={INTERSPEECH 2023},
  year={2023}
}

About

Better WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%