Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

support timestamp for numbers. #986

Merged
merged 6 commits into from
Jan 14, 2025
Merged

support timestamp for numbers. #986

merged 6 commits into from
Jan 14, 2025

Conversation

bfs18
Copy link
Contributor

@bfs18 bfs18 commented Jan 9, 2025

updated get_trellis and backtrack to support align numbers.

import whisperx
import gc 

device = "cuda" 
audio_file = "--BhThOY2Ug_2.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model(
    "large-v3", device, compute_type=compute_type,
    asr_options={"word_timestamps": True, "without_timestamps": False})

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

--BhThOY2Ug_2.mp3

The output is

Detected language: en (1.00) in first 30s of audio...
[{'text': " Thomas' uncle posted an emotional message on Facebook, and the link to a fundraiser to help cover medical costs has raised over $43,000. Matt Uris for CBS 11 News.", 'start': 0.031, 'end': 10.409}]
[{'start': 0.031, 'end': 8.257, 'text': " Thomas' uncle posted an emotional message on Facebook, and the link to a fundraiser to help cover medical costs has raised over $43,000.", 'words': [{'word': "Thomas'", 'start': 0.031, 'end': 0.473, 'score': 0.72}, {'word': 'uncle', 'start': 0.554, 'end': 0.775, 'score': 0.745}, {'word': 'posted', 'start': 0.835, 'end': 1.157, 'score': 0.975}, {'word': 'an', 'start': 1.198, 'end': 1.238, 'score': 0.995}, {'word': 'emotional', 'start': 1.298, 'end': 1.741, 'score': 0.834}, {'word': 'message', 'start': 1.801, 'end': 2.143, 'score': 0.866}, {'word': 'on', 'start': 2.364, 'end': 2.404, 'score': 0.822}, {'word': 'Facebook,', 'start': 2.585, 'end': 3.048, 'score': 0.844}, {'word': 'and', 'start': 3.068, 'end': 3.128, 'score': 0.575}, {'word': 'the', 'start': 3.169, 'end': 3.249, 'score': 0.94}, {'word': 'link', 'start': 3.329, 'end': 3.571, 'score': 0.837}, {'word': 'to', 'start': 3.933, 'end': 4.033, 'score': 0.98}, {'word': 'a', 'start': 4.074, 'end': 4.114, 'score': 0.499}, {'word': 'fundraiser', 'start': 4.194, 'end': 4.717, 'score': 0.746}, {'word': 'to', 'start': 4.757, 'end': 4.798, 'score': 0.998}, {'word': 'help', 'start': 4.838, 'end': 4.979, 'score': 0.88}, {'word': 'cover', 'start': 5.14, 'end': 5.381, 'score': 0.819}, {'word': 'medical', 'start': 5.421, 'end': 5.743, 'score': 0.843}, {'word': 'costs', 'start': 5.823, 'end': 6.286, 'score': 0.973}, {'word': 'has', 'start': 6.708, 'end': 6.849, 'score': 0.998}, {'word': 'raised', 'start': 6.95, 'end': 7.251, 'score': 0.817}, {'word': 'over', 'start': 7.312, 'end': 7.452, 'score': 0.946}, {'word': '$43,000.', 'start': 7.473, 'end': 8.257, 'score': 0.877}]}, {'start': 8.478, 'end': 10.429, 'text': 'Matt Uris for CBS 11 News.', 'words': [{'word': 'Matt', 'start': 8.478, 'end': 9.142, 'score': 0.897}, {'word': 'Uris', 'start': 9.222, 'end': 9.403, 'score': 0.451}, {'word': 'for', 'start': 9.464, 'end': 9.544, 'score': 0.937}, {'word': 'CBS', 'start': 9.564, 'end': 9.866, 'score': 0.737}, {'word': '11', 'start': 9.906, 'end': 10.047, 'score': 0.584}, {'word': 'News.', 'start': 10.148, 'end': 10.429, 'score': 0.814}]}]

Words with numerical elements are now accompanied by timestamps.

 {'word': '$43,000.', 'start': 7.473, 'end': 8.257, 'score': 0.877}]}
 {'word': '11', 'start': 9.906, 'end': 10.047, 'score': 0.584}

@lesca
Copy link

lesca commented Jan 13, 2025

Hello, How do I contribute my changes?

Basically this aligment works well for numbers. However, I found in some cases, two subtitles are displayed at the same time.

I managed to find a simple fix and it works well. To fix this, I added 0.03s to the beginning of sentence and decresed 0.02s from the end of the sentence.

        for sdx, (sstart, send) in enumerate(segment["sentence_spans"]):
            curr_chars = char_segments_arr.loc[(char_segments_arr.index >= sstart) & (char_segments_arr.index <= send)]
            char_segments_arr.loc[(char_segments_arr.index >= sstart) & (char_segments_arr.index <= send), "sentence-idx"] = sdx
        
            sentence_text = text[sstart:send]
            sentence_start = curr_chars["start"].min() + 0.03 # fix
            end_chars = curr_chars[curr_chars["char"] != ' ']
            sentence_end = end_chars["end"].max() - 0.02 # fix
            sentence_words = []

            for word_idx in curr_chars["word-idx"].unique():
                word_chars = curr_chars.loc[curr_chars["word-idx"] == word_idx]
                word_text = "".join(word_chars["char"].tolist()).strip()
                if len(word_text) == 0:
                    continue

                # dont use space character for alignment
                word_chars = word_chars[word_chars["char"] != " "]

                word_start = word_chars["start"].min()
                word_end = word_chars["end"].max()
                word_score = round(word_chars["score"].mean(), 3)

                # -1 indicates unalignable 
                word_segment = {"word": word_text}

                if not np.isnan(word_start):
                    word_segment["start"] = word_start + 0.03 if not sentence_words else word_start # fix
                if not np.isnan(word_end):
                    word_segment["end"] = word_end - 0.02 if word_idx == len(curr_chars["word-idx"].unique()) - 1 else word_end # fix
                if not np.isnan(word_score):
                    word_segment["score"] = word_score

@bfs18
Copy link
Contributor Author

bfs18 commented Jan 13, 2025

Hi @lesca This PRneeds to be merged first. After that, you can contribute your fix (adding 0.03s at start and -0.02s at end) based on the updated main branch to solve the subtitle overlapping issue.

cc @m-bain This pull request is useful as it adds proper timestamp support for numbers.

@m-bain
Copy link
Owner

m-bain commented Jan 13, 2025

great work let me test this today thanks @bfs18

@Barabazs
Copy link
Collaborator

@bfs18 could you change the docstrings and comments to English please?

@bfs18
Copy link
Contributor Author

bfs18 commented Jan 13, 2025

Hi @Barabazs I've already made the changes.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants