🐛(back) manage subtitles content starting with a BOM #2604

lunika · 2024-07-31T14:03:19Z

Purpose

When reading a subtitle content file uploaded, sometimes it starts with a Byte Order Mark and the srt reader is failing to detect the content as a srt one.
We have to remove it before using the detect_format from the pycaption library.

Proposal

manage subtitles content starting with a BOM

src/backend/marsha/core/tests/tasks/test_timed_text_track.py

jbpenrath · 2024-07-31T14:11:25Z

src/backend/marsha/core/tasks/timed_text_track.py

+            timed_text = timed_text_file.read().replace("\ufeff", "")
            reader = detect_format(timed_text)


Looks good as a workaround, but I feel this should be a contribution to pycaption, nope ?

I don't know... pycaption accepts a string in input, now how you retrieve it. I will open an issue to know if there are interested for this contribution

Issue created: pbs/pycaption#341

src/backend/marsha/core/tasks/timed_text_track.py

When reading a subtitle content file uploaded, sometimes it starts with a Byte Order Mark and the srt reader is failing to detect the content as a srt one. We have to remove it before using the detect_format from the pycaption library.

sentry-io · 2024-08-02T14:50:36Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte marsha.core.tasks.timed_text_track.convert_time... View Issue

_{Did you find this useful? React with a 👍 or 👎}

lunika added the bug label Jul 31, 2024

lunika requested review from jbpenrath and wilbrdt July 31, 2024 14:03

lunika self-assigned this Jul 31, 2024

lunika force-pushed the pycatption_srt branch from 8703cda to e0738a9 Compare July 31, 2024 14:04

jbpenrath approved these changes Jul 31, 2024

View reviewed changes

wilbrdt approved these changes Jul 31, 2024

View reviewed changes

src/backend/marsha/core/tasks/timed_text_track.py Show resolved Hide resolved

lunika mentioned this pull request Jul 31, 2024

Manage srt detection with content starting with a BOM pbs/pycaption#341

Open

lunika force-pushed the pycatption_srt branch from e0738a9 to 69021c0 Compare July 31, 2024 14:25

lunika enabled auto-merge (rebase) July 31, 2024 14:30

lunika merged commit 13b1686 into master Jul 31, 2024
32 of 33 checks passed

lunika deleted the pycatption_srt branch July 31, 2024 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛(back) manage subtitles content starting with a BOM #2604

🐛(back) manage subtitles content starting with a BOM #2604

lunika commented Jul 31, 2024

jbpenrath Jul 31, 2024 •

edited

Loading

lunika Jul 31, 2024 •

edited

Loading

lunika Jul 31, 2024

sentry-io bot commented Aug 2, 2024

		timed_text = timed_text_file.read().replace("\ufeff", "")
		reader = detect_format(timed_text)

🐛(back) manage subtitles content starting with a BOM #2604

🐛(back) manage subtitles content starting with a BOM #2604

Conversation

lunika commented Jul 31, 2024

Purpose

Proposal

jbpenrath Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

lunika Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

lunika Jul 31, 2024

Choose a reason for hiding this comment

sentry-io bot commented Aug 2, 2024

Suspect Issues

jbpenrath Jul 31, 2024 •

edited

Loading

lunika Jul 31, 2024 •

edited

Loading