Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Bugfix: Illogical "Avoid computing higher temperatures on no_speech" #1903

Merged
merged 4 commits into from
Dec 1, 2024

Conversation

Purfview
Copy link
Contributor

@Purfview Purfview commented Dec 17, 2023

Bugfix for #1279

The bug: It's "silence" when decoding has failed due to compression_ratio_threshold [+no_speech_threshold] in #1279, when further down the code it's not "silence" anymore.

"Silence" should be only when decoding has failed due to logprob_threshold [+no_speech_threshold].

Like described there:

parser.add_argument("--no_speech_threshold", type=optional_float, default=0.6, help="if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence")

And in code there:

if no_speech_threshold is not None:
# no voice activity check
should_skip = result.no_speech_prob > no_speech_threshold
if (
logprob_threshold is not None
and result.avg_logprob > logprob_threshold
):
# don't skip if the logprob is high enough, despite the no_speech_prob
should_skip = False

@Purfview
Copy link
Contributor Author

Related: SYSTRAN/faster-whisper#621

@Purfview
Copy link
Contributor Author

Purfview commented Dec 17, 2023

I think this bug can trigger the hallucination loops because on some hallucination it wouldn't trigger the prompt reset on high temperature , and because higher temperatures are not computed on what is not an actual "silence".

@Purfview
Copy link
Contributor Author

@TheoBoyer @jongwook , would be great if you could have a look.

@TheoBoyer
Copy link
Contributor

This change is consistent with the rest of the code, so I'm not against it.

The original PR indeed skipped processing based on the logprob_threshold, but it was also contingent on logprob_threshold being set. @jongwook modified this. I assume the intention was to make the process independent of whether a threshold is set, but there may be reasons for this change that I'm unaware of.

However, I'm skeptical about involving logprob_threshold in silence discrimination in the first place.
The approach figure in the original paper clearly shows that there shouldn't be any decoding after no_speech.

Approach

PR #1279 was created because no_speech does not depend on token decoding; hence, regardless of the tokens decoded, no_speech_prob will remain unchanged.

In the (too) few experiments I conducted, the model seemed capable of hallucinating high-probability tokens during silences. It would be beneficial if someone could further investigate the relevance of incorporating logprob_threshold in silence discrimination. I'm also interested to know if any related experiments already exist.

@Purfview
Copy link
Contributor Author

Purfview commented Dec 18, 2023

However, I'm skeptical about involving logprob_threshold in silence discrimination in the first place.

no_speech_threshold alone is pretty unreliable, model can generate no_speech_prob close to 1.0 on a perfectly fine speech.

@Purfview
Copy link
Contributor Author

I think this bug can trigger the hallucinations loop because on some hallucination it wouldn't trigger the prompt reset on high temperature , because higher temperatures are not computed on what is not an actual "silence".

My guess was right, Today I encountered one:

DEBUG: Compression ratio threshold is not met with temperature 0.0 (6.677966 > 2.400000)
[04:17.320 --> 04:29.020]  been doing it for a long time. I'm a professional. I'm a professional. I'm a
[04:29.020 --> 04:29.340]  professional. I'm a professional. I'm a professional. I'm a professional. I'm
[04:29.340 --> 04:34.560]  a professional. I'm a professional. I'm a professional. I'm a professional. I'm
[04:34.560 --> 04:38.360]  a professional. I'm a professional. I'm a professional. I'm a professional. I'm
[04:38.360 --> 05:03.750]  a professional. I'm a professional. I'm a professional. I'm a professional. I'm

No hallucination loop with this bugfix:

DEBUG: Compression ratio threshold is not met with temperature 0.0 (6.677966 > 2.400000)
DEBUG: Compression ratio threshold is not met with temperature 0.2 (8.533333 > 2.400000)
DEBUG: Compression ratio threshold is not met with temperature 0.4 (8.884615 > 2.400000)
[04:17.320 --> 04:22.640]  got me feeling natural. Finding a natural-seeming way to fail at any given task.
[04:23.700 --> 04:27.140]  In each of the commercials that I'm in, I'm the one who simply can't go on
[04:27.140 --> 04:33.340]  without the product. It's ridiculous that we don't have the product. Show them.
DEBUG: Reset prompt. prompt_reset_on_temperature threshold is met 0.600000 > 0.500000
DEBUG: Log probability threshold is not met with temperature 0.0 (-1.344815 < -1.000000)
DEBUG: Log probability threshold is not met with temperature 0.2 (-1.150256 < -1.000000)
[04:33.340 --> 04:35.340]  No, you shouldn't.
[04:36.020 --> 04:36.300]  Please.
[04:36.560 --> 04:37.520]  You wanna see?
[04:38.020 --> 04:39.080]  Yeah, I wanna see.
[04:43.260 --> 04:44.120]  She's amazing.
[05:03.870 --> 05:05.110]  I just...
[05:05.110 --> 05:05.650]  I...

Bugfix for openai#1279

It's "silence" when decoding has failed due to `compression_ratio_threshold` too, when further down the code it's not "silence" anymore.

"Silence" should be only when decoding has failed due to `logprob_threshold`.

Like described there:
https://github.com/openai/whisper/blob/8bc8860694949db53c42ba47ddc23786c2e02a8b/whisper/transcribe.py#L421

And in code there:
https://github.com/openai/whisper/blob/8bc8860694949db53c42ba47ddc23786c2e02a8b/whisper/transcribe.py#L243-L251
@Purfview
Copy link
Contributor Author

Another example of hallucination fix: #1962

@Purfview
Copy link
Contributor Author

Purfview commented Nov 29, 2024

@jongwook Why this bugfix still not merged?

Maybe it's confusing, read the description of #1279 :

In decode_with_fallback, we compute higher temperatures in the case where compression_ratio is too high or avg_logprob is too low. But as the computation of no_speech_prob doens't depend on sampling, we can avoid computing higher temperatures if we detect in the first one that the no_speech condition is fulfilled

This PR still retains full functionality of what is described in the quote above. And fixes the #1279 bug where it skips computing higher temperatures when the no_speech condition is not fulfilled, it should skip only when it's fulfilled.

That bug can cause the hallucination loops, probably it's responsible for a big portion of all those hallucinations reported.
As I understand a sole reason for fallback is to recover from hallucinations, this bug prevents that.

@jongwook jongwook merged commit 90db0de into openai:main Dec 1, 2024
9 checks passed
@Purfview Purfview deleted the patch-1 branch December 1, 2024 07:11
joelvaneenwyk pushed a commit to joelvaneenwyk/whisper that referenced this pull request Dec 31, 2024
…penai#1903)

* Bugfix: Illogical "Avoid computing higher temperatures on no_speech"

Bugfix for openai#1279

It's "silence" when decoding has failed due to `compression_ratio_threshold` too, when further down the code it's not "silence" anymore.

"Silence" should be only when decoding has failed due to `logprob_threshold`.

Like described there:
https://github.com/openai/whisper/blob/8bc8860694949db53c42ba47ddc23786c2e02a8b/whisper/transcribe.py#L421

And in code there:
https://github.com/openai/whisper/blob/8bc8860694949db53c42ba47ddc23786c2e02a8b/whisper/transcribe.py#L243-L251

* Fix if "logprob_threshold=None"

---------

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants