`--preserved_segments_outside_live_window` can fail to delete old segments. #533

BobCu · 2018-12-14T19:44:12Z

--preserved_segments_outside_live_window appears to make a single attempt to remove old segment files, never attempting a retry.

For example, this happens when throttling a 1 Gbps link down to 750 Kbps just when the player is starting the next high-res segment. That segment can stay open for some time before the player quits trying.

Workaround: At the end of my packager startup script, I added this code:

while true; do
  sleep 10
  find "${WWW_ROOT}/www" -mmin +1 -name "*.m4s" -delete >/dev/null 2>&1
done

Possible resolution strategy: Keep a list of all segments generated, and try to delete all segments outside the window each time a new segment is generated, removing the list entry when deletion succeeds. This will ensure multiple tries for every segment outside the window. Perhaps provide an error after a significant number of deletion attempts fail (~10 or so), to permit the operator to address downstream apps that maybe keeping the file open/locked.

Originally posted by @BobCu in #509 (comment)

The text was updated successfully, but these errors were encountered:

BobCu · 2018-12-14T19:46:27Z

Rather than change the packager, adding the above workaround to the docs may be enough to close this issue.

kqyang · 2018-12-14T23:08:53Z

Yes, we need to improve the packager to try to retry the file deletion if it fails. I'll add it to v2.4 milestone for now.

But it is not a good sign if every file deletion needs to be retried.

As mentioned in #509 (comment), if we know that the server needs to hold the file for some more time, it is better to give it more time, e.g. by increasing --preserved_segments_outside_live_window (6 equals 1 minute for a segment duration of 10 seconds).

BobCu · 2018-12-15T00:38:53Z

Not "every", just the few that fail. Even a low-percentage of failures will eventually consume all storage space. That's not a good sign!

kqyang · 2018-12-15T01:10:46Z

Not "every", just the few that fail.

Ok, that is much better. Anyway, agree that we'll need to have the retry logic so the file does not accumulate!

BobCu · 2019-01-08T00:32:56Z

I won't get to test the fix for a little while, but the code looks great.

Thanks!

ohMoshko · 2019-02-26T16:26:11Z

Hello,
I am testing a case of antenna fading signal.
The fix in 6d6db76 does retry to delete old segments if it failed before; however, when shaka-packager is fed with corrupted data (due to low signal power for example) for a long enough period of time, old segments get accumulated and the application eventually crashes. If, at any point before the application crashes, “good” data comes in then the function to remove the old segments gets called and the old segments are removed.
It seems like the functions that delete the old segments- RemoveOldSegment for HLS and RemoveSegments for DASH, don't get called when the data is corrupted but new segments are still being added. Is there a reason why those functions don't get called all the time? For example, in packager/hls/base/media_playlist.cc the call to RemoveOldSegment() is at the end of the SlideWindow() function. Why not call it separately before calling SlideWindow() to ensure the cleanup of old files?

I am using an attenuator to reproduce the issue. The attenuator receives ATSC data and sends it to an ATSC RF receiver. The data is then turns into IP streams, fed into shaka packger and eventually being streamed out. I raise the level of the attenuator until the data is affected by it (can visually see pixelation) and let it run.

kqyang · 2019-02-26T23:16:18Z

@ohMoshko Interesting. I admitted that we did not take data corruption into consideration when designing this solution.

Do you mind describing a little bit more on what is happening? For example, what does packager see for "bad" data? Valid TS streams with incorrect timestamps?

the application eventually crashes

What is causing the crash? Out of storage space? Out of memory?

Is there a reason why those functions don't get called all the time?

I do not know. One possibility is that the timestamp is corrupted causing SlideWindow() to exist early: https://github.com/google/shaka-packager/blob/master/packager/hls/base/media_playlist.cc#L581; it might also be possible that MediaPlaylist::SlideWindow() or MediaPlaylist::AddSegmentInfoEntry() is not invoked at all.

Why not call it separately before calling SlideWindow() to ensure the cleanup of old files?

The files to be deleted are stored in segments_to_be_removed_, which is only updated when SlideWindow() is called, so segments_to_be_removed_ does not grow if SlideWindow() is not called.

ohMoshko · 2019-02-28T15:54:45Z

I am feeding the packager with different ATSC services. The services that are affected by the low signal power start accumulating .mp4 files (audio and video). The services’ media directory gets to around ~900MB and then the Linux OOM_killer is invoked and kills my app that is running shaka-packager:
oom_killer_shaka_log.txt

I can confirm that when the signal is fading, MediaPlaylist::SlideWindow() and MediaPlaylist::AddSegmentInfoEntry() are invoked but the timestamp is probably corrupted causing SlideWindow() to exit early:
https://github.com/google/shaka-packager/blob/master/packager/hls/base/media_playlist.cc#L581
If the signal is "good" again, SlideWindow() does not exit at the above if statement and the RemoveOldSegments() function is called and the old segments are removed correctly.
So I think that it shaka sees a bad data as a valid TS stream with incorrect timestamps.
When the function exits, current_play_time is not 0. It is usually between 1 and 14.
I analyzed the MPEG TS using tsDuck and observed large Discontinuities numbers for some of the PIDs (as expected).
If SlideWindow() continuously exits early, the media folder gets bigger and bigger and eventually the OOM_killer is invoked.

I can record the TS and send it to you if that would help.

kqyang · 2019-02-28T18:27:20Z

@ohMoshko This is probably a different issue than the original issue in #533. Let's discuss in #563.

shaka-bot added the needs triage label Dec 14, 2018

kqyang added type: enhancement New feature or request and removed needs triage labels Dec 14, 2018

kqyang added this to the v2.4 milestone Dec 14, 2018

kqyang self-assigned this Jan 8, 2019

shaka-bot closed this as completed in 6d6db76 Jan 8, 2019

kqyang mentioned this issue Feb 28, 2019

Packager crashes due to OOM when fed with fading signal (bad data) for a long time #563

Closed

shaka-bot added the status: archived Archived and locked; will not be updated label Mar 9, 2019

shaka-project locked and limited conversation to collaborators Mar 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`--preserved_segments_outside_live_window` can fail to delete old segments. #533

`--preserved_segments_outside_live_window` can fail to delete old segments. #533

BobCu commented Dec 14, 2018

BobCu commented Dec 14, 2018

kqyang commented Dec 14, 2018

BobCu commented Dec 15, 2018

kqyang commented Dec 15, 2018

BobCu commented Jan 8, 2019

ohMoshko commented Feb 26, 2019

kqyang commented Feb 26, 2019

ohMoshko commented Feb 28, 2019 •

edited

Loading

kqyang commented Feb 28, 2019

--preserved_segments_outside_live_window can fail to delete old segments. #533

--preserved_segments_outside_live_window can fail to delete old segments. #533

Comments

BobCu commented Dec 14, 2018

BobCu commented Dec 14, 2018

kqyang commented Dec 14, 2018

BobCu commented Dec 15, 2018

kqyang commented Dec 15, 2018

BobCu commented Jan 8, 2019

ohMoshko commented Feb 26, 2019

kqyang commented Feb 26, 2019

ohMoshko commented Feb 28, 2019 • edited Loading

kqyang commented Feb 28, 2019

`--preserved_segments_outside_live_window` can fail to delete old segments. #533

`--preserved_segments_outside_live_window` can fail to delete old segments. #533

ohMoshko commented Feb 28, 2019 •

edited

Loading