-
I have some inputs in Norwegian with 50 minute long segments with no speech and only background noise (a lunchbreak). Whisper falls into a loop repeating the last(?) segment spoken and then after the speech resumes it never breaks out of it. Same behavior in translate and transcribe mode (though the output it repeats is different). Truncating off the non-speech segment and restarting gets me a sensible transcription. I can put up a test file if it would be useful, though I assume the behavior is easily reproducible. |
Beta Was this translation helpful? Give feedback.
Replies: 11 comments 36 replies
-
I get a similar behavior with no speech and only background noise periods in English and Spanish audio. Sometimes it will make comical nonsensical transcriptions. One quiet period produced 'I hope you liked the video, don't forget to leave a like and subscribe to the channel, thank you very much for your attention and see you in the next video,' despite this never being said. Interesting how humans can hear things when too much silence too. |
Beta Was this translation helpful? Give feedback.
-
I got the same problem. I think predicted No update of Lines 220 to 222 in 8cf36f3 past tokens are used as prompt. Line 162 in 8cf36f3 Past strange outputs may have caused more strange outputs. |
Beta Was this translation helpful? Give feedback.
-
This is one of the limitations of the current hacky approach to long-form transcription. The VAD output from the model is not very accurate, and the predicted I chose the default There's also a hard-coded constant as @shirayu mentioned, which determines whether the text from the previous window gets fed as the prompt: Lines 220 to 222 in 2d3032d This is supposed to help the text flow more naturally between window boundaries, but as @shirayu mentioned, this makes the decoding more prone to repetition looping. I'll plan to document this limitation more clearly in README.md. |
Beta Was this translation helpful? Give feedback.
-
I've managed to mostly solve this by segmenting the input audio with Silero VAD. Here's my implementation: |
Beta Was this translation helpful? Give feedback.
-
FWIW, I've found that the current code (vs the day 1 release) works much better on my inputs and is now able to transcribe complete recordings that the original code choked on. |
Beta Was this translation helpful? Give feedback.
-
I have the same issue with multilingual audio files on the CLI version of whisper, if there is any significant silence whisper does not resume transcribing and instead repeats the last thing said forever. For whatever reason, this does not occur if I call Whisper from within a python program. To get around this I created a little program to call the python version of Whisper from the command line. I know next to nothing about python so there's very likely an easier option, but in the meantime this does work. I call it WhisperDO. WhisperDO.py:
It can be called as python whisperDO.py --args and accepts all the same inputs as the CLI version (as far as I can tell). |
Beta Was this translation helpful? Give feedback.
-
I wonder is that an issue that can be fixed in Whisper itself or do we have to resort to use something like Silero VAD? |
Beta Was this translation helpful? Give feedback.
-
You can try this script: whisper-vad. |
Beta Was this translation helpful? Give feedback.
-
For those who wonder how one can adjust model = whisper.load_model('tiny')
options = whisper.DecodingOptions().__dict__.copy()
options['no_speech_threshold'] = 0.275
options['logprob_threshold'] = None
result = model.transcribe('/path/to/file', **options, verbose=False) A full example in action can be seen at: whisper_transcriber.py. |
Beta Was this translation helpful? Give feedback.
-
I've found that if one adds The
This works is because in if not condition_on_previous_text or result.temperature > 0.5:
# do not feed the prompt tokens if a high temperature was used
prompt_reset_since = len(all_tokens) then the prompt_reset_since will make the next text generation not be affected by the former text. |
Beta Was this translation helpful? Give feedback.
-
The "silence equals hallucinations" issue is a very common and very problematic issue. And yes, I had --condition_on_previous_text False as an option too. Look at the example below! |
Beta Was this translation helpful? Give feedback.
This is one of the limitations of the current hacky approach to long-form transcription. The VAD output from the model is not very accurate, and the predicted
no_speech_prob
is often not a reliable predictor of voice activity.I chose the default
no_speech_threshold
of 0.6 which worked okay for a few datasets that I tested with, but different combinations of--compression_ratio_threshold
,--logprob_threshold
, and--no_speech_threshold
values might be needed depending on the audio.There's also a hard-coded constant as @shirayu mentioned, which determines whether the text from the previous window gets fed as the prompt:
whisper/whisper/transcribe.py
Lines 220 to 222 in 2d3032d