Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fixing words that have missing timestamps #349

Closed
crisprin17 opened this issue Jun 29, 2023 · 3 comments
Closed

Fixing words that have missing timestamps #349

crisprin17 opened this issue Jun 29, 2023 · 3 comments

Comments

@crisprin17
Copy link

I noticed that sometimes the word within the "words" list don't have a start or end time if they are made of only numbers.
This function I have been using for fixing missing "start" or "end" timestamps in segments.
Hope this can help people who are having the same issue!

`

def fix_whisperX_timestamp(segments):
"""
Fixes missing "start" or "end" timestamps in the segments before assigning speakers.

Args:
    segments (list): List of segments containing words with missing timestamps.

Returns:
    list: Updated segments with filled timestamps.
"""
for seg in segments:
    seg_length = len(seg["words"])

    # Step 1: Fill missing "start" timestamps by increasing values
    prev_end = seg["start"]
    missing_starts_idx = []
    for i in range(seg_length):
        if "start" not in seg["words"][i].keys():
            seg["words"][i]["start"] = prev_end + EPSILON
            missing_starts_idx.append(i)
        else:
            prev_end = seg["words"][i]["end"]

    # Step 2: Fill missing "end" timestamps by decreasing values
    next_start = seg["end"]
    for j in range(seg_length - 1, -1, -1):
        if "end" not in seg["words"][j].keys():
            seg["words"][j]["end"] = next_start - EPSILON
        else:
            next_start = seg["words"][j]["start"]

    # Fix consecutive missing values
    missing_count = len(missing_starts_idx) - seg_length
    if missing_count > 0:
        assert missing_count % 2 == 0  # Ensure an even number of missing values
        consecutive_pairs = [
            (missing_starts_idx[i], missing_starts_idx[i + 1])
            for i in range(len(missing_starts_idx) - 1)
            if missing_starts_idx[i] + 1 == missing_starts_idx[i + 1]
        ]

        for start_idx, next_idx in consecutive_pairs:
            # Fill consecutive missing values using interpolation
            prev_end = seg["words"][start_idx - 1]["end"]
            next_start = seg["words"][start_idx + 2]["start"]

            Dt = next_start - prev_end
            lw0 = len(seg["words"][start_idx]["word"])
            lw1 = len(seg["words"][next_idx]["word"])
            dw0 = (Dt / (lw0 + lw1)) * lw0

            seg["words"][start_idx]["start"] = prev_end + EPSILON
            seg["words"][start_idx]["end"] = prev_end + dw0
            seg["words"][next_idx]["start"] = prev_end + dw0 + EPSILON
            seg["words"][next_idx]["end"] = next_start - EPSILON

return segments

`

@mbalandis
Copy link

mbalandis commented Jul 6, 2023

I have encountered same issue and the solution can be as simple as this if you just want to post-process the broken json file. Simply feed in the output json and it will fill in the gap by being start where previous ended and end where next starts, assuming same speaker (if exists in json) and giving whatever score such as 0.5.

def estimate_time(words_from_segments):
    for i, word in enumerate(words_from_segments):
        try:
            if "start" not in word or "end" not in word:
                prev_word_end = words_from_segments[i - 1]["end"]
                next_word_start = words_from_segments[i + 1]["start"]
                word["start"] = prev_word_end
                word["end"] = next_word_start
                word["score"] = 0.5
                word["speaker"] = words_from_segments[i - 1]["speaker"] if "speaker" in words_from_segments[i - 1] else None
        except Exception as e:  # Catch-all for any other exceptions
            print(f"Unexpected error while estimating time for word: {word}. Error details: {e}")
    return words_from_segments

Might have edge cases and need adjustments to your use case but if parsing json later depends on having start and end times it will at least not crash :)

@CerealNopon
Copy link

I have encountered same issue and the solution can be as simple as this if you just want to post-process the broken json file. Simply feed in the output json and it will fill in the gap by being start where previous ended and end where next starts, assuming same speaker (if exists in json) and giving whatever score such as 0.5.

def estimate_time(words_from_segments):
    for i, word in enumerate(words_from_segments):
        try:
            if "start" not in word or "end" not in word:
                prev_word_end = words_from_segments[i - 1]["end"]
                next_word_start = words_from_segments[i + 1]["start"]
                word["start"] = prev_word_end
                word["end"] = next_word_start
                word["score"] = 0.5
                word["speaker"] = words_from_segments[i - 1]["speaker"] if "speaker" in words_from_segments[i - 1] else None
        except Exception as e:  # Catch-all for any other exceptions
            print(f"Unexpected error while estimating time for word: {word}. Error details: {e}")
    return words_from_segments

Might have edge cases and need adjustments to your use case but if parsing json later depends on having start and end times it will at least not crash :)

Updated code to account for consecutive cases of missing 'start' and 'end'

def estimate_time(words_from_segments):
    for i, word in enumerate(words_from_segments):
        try:
            if "start" not in word or "end" not in word:             # enter loop if current word is missing "start" or "end"
                pointer = 1                                          # let the pointer to the next word start at 1
                prev_word_end = words_from_segments[i - 1]["end"]    # grab the end from the last one (should always work)

                try:
                    next_word_start = words_from_segments[i + pointer]["start"]   # try to grab the start from the next one
                except KeyError:                                                  # if trying to grab results in the error
                    pointer += 1                                                  # we'll increment pointer to next 
                    next_word_start = ""                                          # and add placeholder string

                while (next_word_start == ""):                                      # while next_word_start is a place holder
                    try:                                                            
                        next_word_start = words_from_segments[i + pointer]["start"] # grab the start time from the next word
                        #if successful: find difference, and then divide to find increment. Add increment to prev_word_end and assign.
                        next_word_start = ((next_word_start - prev_word_end) / pointer) + prev_word_end              
                    except KeyError:
                        pointer += 1                                                # if another error, increment the pointer

                word["start"] = prev_word_end                           #set prevEnd to currStr
                word["end"] = next_word_start                           #set prevStr to currEnd
                word["score"] = 0.5
                word["speaker"] = words_from_segments[i - 1]["speaker"] if "speaker" in words_from_segments[i - 1] else None
        except Exception as e:  # Catch-all for any other exceptions
            print(f"Unexpected error while estimating time for word: {word}. Error details: {e}")
    return words_from_segments

@Barabazs
Copy link
Collaborator

Should be fixed with #986

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants