Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Regarding the issue of sentence length #916

Open
heartInsert opened this issue Nov 3, 2024 · 2 comments
Open

Regarding the issue of sentence length #916

heartInsert opened this issue Nov 3, 2024 · 2 comments

Comments

@heartInsert
Copy link

Like, some sentences are too long for subtitle files. Is there a way to limit the length of transcribed sentences or split long sentences in code? Thanks.

@jonathanfox5
Copy link

jonathanfox5 commented Nov 22, 2024

I've been playing about with this today. The SubtitlesProcessor module included with whisperx is really good!

from whisperx.SubtitlesProcessor import SubtitlesProcessor

# Do all of your whisper transcribing / alignment here
# Output of the alignment stage should be an object called `result`

# All variable names below apart from `result` are settings that can be exposed to the user.
subtitles_proccessor = SubtitlesProcessor(
    result["segments"],
    language_code, # str, two letter code to identify the language
    max_line_length=max_line_length, # int, around 100 has been working for me
    min_char_length_splitter=sub_split_threshold, # int, around 70 has been working for me
    is_vtt=is_vtt, # bool, true for vtt, false for srt format
)
subtitles_proccessor.save(output_path, advanced_splitting=True) # output_path is a str with your desired filename

There's an alternative example in the pull request here

@heartInsert
Copy link
Author

I really love you , bro , you are my hero.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants