Stops working after long gap with no speech? #29

gmaxwell · 2022-09-22T08:40:52Z

gmaxwell
Sep 22, 2022

I have some inputs in Norwegian with 50 minute long segments with no speech and only background noise (a lunchbreak). Whisper falls into a loop repeating the last(?) segment spoken and then after the speech resumes it never breaks out of it. Same behavior in translate and transcribe mode (though the output it repeats is different).

Truncating off the non-speech segment and restarting gets me a sensible transcription.

I can put up a test file if it would be useful, though I assume the behavior is easily reproducible.

Answered by jongwook

Sep 25, 2022

This is one of the limitations of the current hacky approach to long-form transcription. The VAD output from the model is not very accurate, and the predicted no_speech_prob is often not a reliable predictor of voice activity.

I chose the default no_speech_threshold of 0.6 which worked okay for a few datasets that I tested with, but different combinations of --compression_ratio_threshold, --logprob_threshold, and --no_speech_threshold values might be needed depending on the audio.

There's also a hard-coded constant as @shirayu mentioned, which determines whether the text from the previous window gets fed as the prompt:

whisper/whisper/transcribe.py

Lines 220 to 222 in 2d3032d

if

View full answer

jonhassall · 2022-09-23T11:00:07Z

jonhassall
Sep 23, 2022

I get a similar behavior with no speech and only background noise periods in English and Spanish audio. Sometimes it will make comical nonsensical transcriptions.

One quiet period produced 'I hope you liked the video, don't forget to leave a like and subscribe to the channel, thank you very much for your attention and see you in the next video,' despite this never being said.

Interesting how humans can hear things when too much silence too.

3 replies

olastor Nov 15, 2022

For me it transcribed "[03:05.120 --> 03:10.680] Find out more at www.beadaholique.com to purchase beading supplies and to get design ideas!" in a longer section with only silence 😄

Shredsauce Jun 10, 2023

For me it transcribed "[03:05.120 --> 03:10.680] Find out more at www.beadaholique.com to purchase beading supplies and to get design ideas!" in a longer section with only silence 😄

This same thing happened to me and you're the only other one I've found who's seen it too. That website probably managed to sneak a spammy commit into the repo. For now some kind of regex thing to filter out spam will have to do.

endimionzf Dec 16, 2023

I get the same like subscribe message, but in romanian

shirayu · 2022-09-24T19:43:39Z

shirayu
Sep 24, 2022

I got the same problem.

I think predicted no_speech_prob values are too small for silent periods.

No update of prompt_reset_since may also cause repeating outputs.
If the following code is not executed,

whisper/whisper/transcribe.py

Lines 220 to 222 in 8cf36f3

    
           if result.temperature > 0.5: 
        
               # do not feed the prompt tokens if a high temperature was used 
        
               prompt_reset_since = len(all_tokens)

past tokens are used as prompt.

whisper/whisper/transcribe.py

Line 162 in 8cf36f3

decode_options["prompt"] = all_tokens[prompt_reset_since:]

Past strange outputs may have caused more strange outputs.

4 replies

koka0630 Sep 25, 2022

your answer is helpful for me, thanks 👍
then, how can we fix this bug in CLI? or we have to do data process by trimming silence?

koka0630 Sep 25, 2022

how can we fix this bug in CLI?

ah, can we decrease no_speech_threshold?

shirayu Sep 25, 2022

By setting

--no_speech_threshold more smaller value (defualt 0.6)
--logprob_threshold None or --logprob_threshold {small value} (default -1.0)

should_skip become True and the continue line is executed.

whisper/whisper/transcribe.py

Lines 251 to 252 in 15ab548

    
           parser.add_argument("--logprob_threshold", type=optional_float, default=-1.0, help="if the average log probability is lower than this value, treat the decoding as failed") 
        
           parser.add_argument("--no_speech_threshold", type=optional_float, default=0.6, help="if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence")

whisper/whisper/transcribe.py

Lines 166 to 175 in 15ab548

    
           if no_speech_threshold is not None: 
        
               # no voice activity check 
        
               should_skip = result.no_speech_prob > no_speech_threshold 
        
               if logprob_threshold is not None and result.avg_logprob > logprob_threshold: 
        
                   # don't skip if the logprob is high enough, despite the no_speech_prob 
        
                   should_skip = False 
        
               if should_skip: 
        
                   seek += segment.shape[-1]  # fast-forward to the next segment boundary 
        
                   continue

This has the effect of not transcribing silence intervals.
However, note that if the predicted logprob_threshold value is low for the voices, they will not be transcribed.

For more substantive improvements, we need to make the model improve to output no_speech_threshold more accurately and set appropriate thresholds.

shirayu Sep 25, 2022

According to their paper, they used 30-second slient segments for predicting <|nospeech|>.

We break audio files into 30-second segments paired with the subset of the transcript that occurs within that time segment. We train on all audio, including segments where there is no speech (though with sub-sampled probability) and use these segments as training data for voice activity detection.
In the case where there is no speech in an audio segment, the model is trained to predict a <|nospeech|> token indicating this.

I am interested in the amount and percentage of "silence", what kind of "silence" is included, and the performance of prediction of "silence" (i.e. VAD = Voice activity detection).

jongwook · 2022-09-25T09:25:21Z

jongwook
Sep 25, 2022
Maintainer

This is one of the limitations of the current hacky approach to long-form transcription. The VAD output from the model is not very accurate, and the predicted no_speech_prob is often not a reliable predictor of voice activity.

I chose the default no_speech_threshold of 0.6 which worked okay for a few datasets that I tested with, but different combinations of --compression_ratio_threshold, --logprob_threshold, and --no_speech_threshold values might be needed depending on the audio.

There's also a hard-coded constant as @shirayu mentioned, which determines whether the text from the previous window gets fed as the prompt:

whisper/whisper/transcribe.py

Lines 220 to 222 in 2d3032d

    
           if result.temperature > 0.5: 
        
               # do not feed the prompt tokens if a high temperature was used 
        
               prompt_reset_since = len(all_tokens)

This is supposed to help the text flow more naturally between window boundaries, but as @shirayu mentioned, this makes the decoding more prone to repetition looping.

I'll plan to document this limitation more clearly in README.md.

6 replies

light42 Sep 28, 2022

How about using webrtcvad? It is quite outdated and has some limits, but when I used it in my project it produced alright results.

AdolfVonKleist Sep 29, 2022

https://github.com/snakers4/silero-vad is much better than webrtcvad IMO. Quite easy to use and worth a look.

shirayu Oct 2, 2022

@AdolfVonKleist Thank you for the nice link.
I made whispering drop every 3.75 sec silent period as default by using it.

cndhng Oct 19, 2022

Hi, I am new to CLI and still not used to some of these parameters. Would you mind explaining
--no_speech_threshold --compression_ratio_threshold --logprob_threshold and how you write them syntactically correct in the cmd prompt, I have a similar issue where there are some longer no-speech periods and it always uses the last detected speech segment and repeats through the end of the transcription.

StefanNa3Shape Dec 6, 2024

So I have the same issue and I used pyannotes VAD https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/voice_activity_detection.ipynb
That works very well for me in terms of finding the segments with voice.
But then I run whisper on all the segments with speech and it still hallucinates a lot.
I lowered no_speech_threshold but that does not make a difference.
Does anyone have new insights?

ANonEntity · 2022-09-27T14:21:47Z

ANonEntity
Sep 27, 2022

I've managed to mostly solve this by segmenting the input audio with Silero VAD. Here's my implementation:
https://colab.research.google.com/github/ANonEntity/WhisperWithVAD/blob/main/WhisperWithVAD.ipynb

16 replies

KTRosenberg Oct 11, 2022

@KTRosenberg Maybe this would work: https://github.com/shirayu/whispering. They use snakers4/silero-vad as well.

Ah, I'll try this. However, should I somehow integrate one of the alternative VADs discussed here? If I interpret the conversation correctly, longform transcription could lead to loops.

One thing I wonder: would it be possible to (programmatically) split the audio and restart the transcription between sentence boundaries automatically? For example, if a piece of the transcription is final? That could avoid too much memory use. A lot of the commercial solutions output continuous hypotheses as words are recognized, but force cancel the transcription after a timeout. I'm trying to have a little more control over that so words don't get cut-off. Hopefully this makes sense.

aadnk Oct 11, 2022

One thing I wonder: would it be possible to (programmatically) split the audio and restart the transcription between sentence boundaries automatically? For example, if a piece of the transcription is final? That could avoid too much memory use. A lot of the commercial solutions output continuous hypotheses as words are recognized, but force cancel the transcription after a timeout. I'm trying to have a little more control over that so words don't get cut-off. Hopefully this makes sense.

So if I understand you correctly, you'd like to run Whisper on the microphone input as soon as possible, generating some guesses. But then re-run Whisper on the whole sentence, once the VAD (for instance snakers4/silero) has detected a sentence boundary/non-speech part?

I haven't checked if this isn't already done by shirayu/whispering, but it definitely sounds possible. For instance, you could have two buffers that is continuously feed by the microphone asynchronously - one that is processed nearly instantly for continuous hypotheses, and one that's only processed once the VAD detects lack of speech for some minimum duration. You can also pass the previous detected text as a prompt to Whisper, to improve the detection in each speech chunk.

And of course, the GPU has to be powerful enough to keep up with the input from the microphone - though of course, if that isn't always the case, you could let the buffer for continuous hypotheses fill up and just not generate hypotheses while you still generate full sentences from the other buffer.

KTRosenberg Oct 11, 2022

One thing I wonder: would it be possible to (programmatically) split the audio and restart the transcription between sentence boundaries automatically? For example, if a piece of the transcription is final? That could avoid too much memory use. A lot of the commercial solutions output continuous hypotheses as words are recognized, but force cancel the transcription after a timeout. I'm trying to have a little more control over that so words don't get cut-off. Hopefully this makes sense.

So if I understand you correctly, you'd like to run Whisper on the microphone input as soon as possible, generating some guesses. But then re-run Whisper on the whole sentence, once the VAD (for instance snakers4/silero) has detected a sentence boundary/non-speech part?

I haven't checked if this isn't already done by shirayu/whispering, but it definitely sounds possible. For instance, you could have two buffers that is continuously feed by the microphone asynchronously - one that is processed nearly instantly for continuous hypotheses, and one that's only processed once the VAD detects lack of speech for some minimum duration. You can also pass the previous detected text as a prompt to Whisper, to improve the detection in each speech chunk.

And of course, the GPU has to be powerful enough to keep up with the input from the microphone - though of course, if that isn't always the case, you could let the buffer for continuous hypotheses fill up and just not generate hypotheses while you still generate full sentences from the other buffer.

I think this is correct, yes.Basically, a previous commercial solution I was using would fire callbacks upon a new hypothesis/word detected, and timeout after a fixed amount of silence. The difficulty there was figuring out how to cross the "boundary" between different chunks. If Whisper could make this easier, that would be great. I could probably pass-in some fixed window of previous text to keep things bounded.

Anyway, I think your description makes sense, though I'm still exploring the API. Besides shirayu's repository, are there examples of creating this sort of multi-buffer solution? I'm not actually aware of how one would process data asynchronously in Python since as I recall, it's always one thread.

"one that's only processed once the VAD detects lack of speech for some minimum duration" -- I think I would just set a fixed timeout since my understanding from this discussion is that VAD could break.

phineas-pta Apr 10, 2023

I adapted your notebook to Faster Whisper, with some minor adjustments for Japanese: https://github.com/phineas-pta/VwArship

tvone Dec 24, 2023

@phineas-pta not working

gmaxwell · 2022-10-03T10:52:50Z

gmaxwell
Oct 3, 2022
Author

FWIW, I've found that the current code (vs the day 1 release) works much better on my inputs and is now able to transcribe complete recordings that the original code choked on.

1 reply

alexlyzhov Oct 10, 2022

I tried the current code and it still has the same behavior for me in terms of handling long silences (can't transcribe, endless repetitions/hallucinations)

nicholasgcotton · 2022-10-31T20:53:12Z

nicholasgcotton
Oct 31, 2022

I have the same issue with multilingual audio files on the CLI version of whisper, if there is any significant silence whisper does not resume transcribing and instead repeats the last thing said forever. For whatever reason, this does not occur if I call Whisper from within a python program.

To get around this I created a little program to call the python version of Whisper from the command line. I know next to nothing about python so there's very likely an easier option, but in the meantime this does work. I call it WhisperDO.

WhisperDO.py:

import whisper
from whisper.transcribe import cli
cli()

It can be called as python whisperDO.py --args and accepts all the same inputs as the CLI version (as far as I can tell).

3 replies

ejkitchen Nov 2, 2022

What parameters are you passing to Whisper?

nicholasgcotton Nov 2, 2022

Just the basics, --language, --task, and --model.

I tried changing compression ratio a number of times but it had no effect as far as I could tell. Issue has repeated with"Chinese" and Punjabi conversations, although in both cases the audio started with an English introduction. Removing silence from the files below -20bd using Audacity also fixes the issue, but then of course SRT/VTT times are off.

nicholasgcotton Nov 2, 2022

I'm limited to the small and medium models as I've currently only got an T1200 and RTX 3070 available for testing.

Shamshiel · 2022-11-03T06:31:50Z

Shamshiel
Nov 3, 2022

I wonder is that an issue that can be fixed in Whisper itself or do we have to resort to use something like Silero VAD?

0 replies

e-maalouly · 2022-11-11T06:31:24Z

e-maalouly
Nov 11, 2022

You can try this script: whisper-vad.
It uses a VAD to detect voice activity, and only feeds those segments to Whisper. It fixed the loop issues for me.

2 replies

cndhng Nov 13, 2022

Hello, would you mind to show me how to use it with whisper. I'm new to this how CLI thing. Thank you

turnkit Apr 4, 2023

Does this change the timing of the source file vs. the timed subtitles? I need the output to match the original.

Does using this method require the audio to be uploaded to a service, with an API it would appear so. I need to keep all my data only locally. Could there be a solution to get it to run locally?

MatthiasReumann · 2022-12-19T07:24:36Z

MatthiasReumann
Dec 19, 2022

For those who wonder how one can adjust --no_speech_threshold and --logprob_threshold (or any other parameter actually) in Python using the DecodingOptions object:

model = whisper.load_model('tiny')
options = whisper.DecodingOptions().__dict__.copy()
options['no_speech_threshold'] = 0.275
options['logprob_threshold'] = None

result = model.transcribe('/path/to/file', **options, verbose=False)

A full example in action can be seen at: whisper_transcriber.py.

0 replies

BJKriptal · 2023-03-26T09:56:01Z

BJKriptal
Mar 26, 2023

I've found that if one adds --condition_on_previous_text False when starting whisper, the strange output repeating continuously will not appear.

The --condition_on_previous_text has this effect.

if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across
windows,` but the model becomes less prone to getting stuck in a failure loop

This works is because in transcribe.py, setting condition_on_previous_text to False force the following code to be executed.

 if not condition_on_previous_text or result.temperature > 0.5:
                # do not feed the prompt tokens if a high temperature was used
                prompt_reset_since = len(all_tokens)

then the prompt_reset_since will make the next text generation not be affected by the former text.

0 replies

turnkit · 2023-04-04T17:30:33Z

turnkit
Apr 4, 2023

The "silence equals hallucinations" issue is a very common and very problematic issue.

And yes, I had --condition_on_previous_text False as an option too.

Look at the example below!

1 reply

Ko4ka Nov 2, 2024

I think I was able to solve this one by removing segments based on their compresson ratio
#2420

Stops working after long gap with no speech? #29

Replies: 11 comments · 36 replies

jongwook Sep 25, 2022 Maintainer

gmaxwell Oct 3, 2022 Author

Replies: 11 comments 36 replies

jongwook
Sep 25, 2022
Maintainer

gmaxwell
Oct 3, 2022
Author