Fixing words that have missing timestamps #349

crisprin17 · 2023-06-29T00:31:33Z

I noticed that sometimes the word within the "words" list don't have a start or end time if they are made of only numbers.
This function I have been using for fixing missing "start" or "end" timestamps in segments.
Hope this can help people who are having the same issue!

`

def fix_whisperX_timestamp(segments):
"""
Fixes missing "start" or "end" timestamps in the segments before assigning speakers.

Args:
    segments (list): List of segments containing words with missing timestamps.

Returns:
    list: Updated segments with filled timestamps.
"""
for seg in segments:
    seg_length = len(seg["words"])

    # Step 1: Fill missing "start" timestamps by increasing values
    prev_end = seg["start"]
    missing_starts_idx = []
    for i in range(seg_length):
        if "start" not in seg["words"][i].keys():
            seg["words"][i]["start"] = prev_end + EPSILON
            missing_starts_idx.append(i)
        else:
            prev_end = seg["words"][i]["end"]

    # Step 2: Fill missing "end" timestamps by decreasing values
    next_start = seg["end"]
    for j in range(seg_length - 1, -1, -1):
        if "end" not in seg["words"][j].keys():
            seg["words"][j]["end"] = next_start - EPSILON
        else:
            next_start = seg["words"][j]["start"]

    # Fix consecutive missing values
    missing_count = len(missing_starts_idx) - seg_length
    if missing_count > 0:
        assert missing_count % 2 == 0  # Ensure an even number of missing values
        consecutive_pairs = [
            (missing_starts_idx[i], missing_starts_idx[i + 1])
            for i in range(len(missing_starts_idx) - 1)
            if missing_starts_idx[i] + 1 == missing_starts_idx[i + 1]
        ]

        for start_idx, next_idx in consecutive_pairs:
            # Fill consecutive missing values using interpolation
            prev_end = seg["words"][start_idx - 1]["end"]
            next_start = seg["words"][start_idx + 2]["start"]

            Dt = next_start - prev_end
            lw0 = len(seg["words"][start_idx]["word"])
            lw1 = len(seg["words"][next_idx]["word"])
            dw0 = (Dt / (lw0 + lw1)) * lw0

            seg["words"][start_idx]["start"] = prev_end + EPSILON
            seg["words"][start_idx]["end"] = prev_end + dw0
            seg["words"][next_idx]["start"] = prev_end + dw0 + EPSILON
            seg["words"][next_idx]["end"] = next_start - EPSILON

return segments

`

The text was updated successfully, but these errors were encountered:

mbalandis · 2023-07-06T09:31:29Z

I have encountered same issue and the solution can be as simple as this if you just want to post-process the broken json file. Simply feed in the output json and it will fill in the gap by being start where previous ended and end where next starts, assuming same speaker (if exists in json) and giving whatever score such as 0.5.

def estimate_time(words_from_segments):
    for i, word in enumerate(words_from_segments):
        try:
            if "start" not in word or "end" not in word:
                prev_word_end = words_from_segments[i - 1]["end"]
                next_word_start = words_from_segments[i + 1]["start"]
                word["start"] = prev_word_end
                word["end"] = next_word_start
                word["score"] = 0.5
                word["speaker"] = words_from_segments[i - 1]["speaker"] if "speaker" in words_from_segments[i - 1] else None
        except Exception as e:  # Catch-all for any other exceptions
            print(f"Unexpected error while estimating time for word: {word}. Error details: {e}")
    return words_from_segments

Might have edge cases and need adjustments to your use case but if parsing json later depends on having start and end times it will at least not crash :)

CerealNopon · 2023-08-04T18:00:56Z

I have encountered same issue and the solution can be as simple as this if you just want to post-process the broken json file. Simply feed in the output json and it will fill in the gap by being start where previous ended and end where next starts, assuming same speaker (if exists in json) and giving whatever score such as 0.5.
def estimate_time(words_from_segments):
    for i, word in enumerate(words_from_segments):
        try:
            if "start" not in word or "end" not in word:
                prev_word_end = words_from_segments[i - 1]["end"]
                next_word_start = words_from_segments[i + 1]["start"]
                word["start"] = prev_word_end
                word["end"] = next_word_start
                word["score"] = 0.5
                word["speaker"] = words_from_segments[i - 1]["speaker"] if "speaker" in words_from_segments[i - 1] else None
        except Exception as e:  # Catch-all for any other exceptions
            print(f"Unexpected error while estimating time for word: {word}. Error details: {e}")
    return words_from_segments
Might have edge cases and need adjustments to your use case but if parsing json later depends on having start and end times it will at least not crash :)

Updated code to account for consecutive cases of missing 'start' and 'end'

def estimate_time(words_from_segments):
    for i, word in enumerate(words_from_segments):
        try:
            if "start" not in word or "end" not in word:             # enter loop if current word is missing "start" or "end"
                pointer = 1                                          # let the pointer to the next word start at 1
                prev_word_end = words_from_segments[i - 1]["end"]    # grab the end from the last one (should always work)

                try:
                    next_word_start = words_from_segments[i + pointer]["start"]   # try to grab the start from the next one
                except KeyError:                                                  # if trying to grab results in the error
                    pointer += 1                                                  # we'll increment pointer to next 
                    next_word_start = ""                                          # and add placeholder string

                while (next_word_start == ""):                                      # while next_word_start is a place holder
                    try:                                                            
                        next_word_start = words_from_segments[i + pointer]["start"] # grab the start time from the next word
                        #if successful: find difference, and then divide to find increment. Add increment to prev_word_end and assign.
                        next_word_start = ((next_word_start - prev_word_end) / pointer) + prev_word_end              
                    except KeyError:
                        pointer += 1                                                # if another error, increment the pointer

                word["start"] = prev_word_end                           #set prevEnd to currStr
                word["end"] = next_word_start                           #set prevStr to currEnd
                word["score"] = 0.5
                word["speaker"] = words_from_segments[i - 1]["speaker"] if "speaker" in words_from_segments[i - 1] else None
        except Exception as e:  # Catch-all for any other exceptions
            print(f"Unexpected error while estimating time for word: {word}. Error details: {e}")
    return words_from_segments

Barabazs · 2025-01-17T08:54:39Z

Should be fixed with #986

ProducerMatt mentioned this issue Jul 20, 2023

[bug] whisperX word segmentation fails: KeyError: 'start' absadiki/subsai#53

Closed

aleksandr-smechov mentioned this issue Jul 20, 2023

Look into alignment issue Wordcab/wordcab-transcribe#154

Closed

Barabazs closed this as completed Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing words that have missing timestamps #349

Fixing words that have missing timestamps #349

crisprin17 commented Jun 29, 2023

mbalandis commented Jul 6, 2023 •

edited

Loading

CerealNopon commented Aug 4, 2023

Barabazs commented Jan 17, 2025

Fixing words that have missing timestamps #349

Fixing words that have missing timestamps #349

Comments

crisprin17 commented Jun 29, 2023

mbalandis commented Jul 6, 2023 • edited Loading

CerealNopon commented Aug 4, 2023

Barabazs commented Jan 17, 2025

mbalandis commented Jul 6, 2023 •

edited

Loading