Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

--preserved_segments_outside_live_window can fail to delete old segments. #533

Closed
BobCu opened this issue Dec 14, 2018 · 9 comments
Closed
Assignees
Labels
status: archived Archived and locked; will not be updated type: enhancement New feature or request
Milestone

Comments

@BobCu
Copy link

BobCu commented Dec 14, 2018

--preserved_segments_outside_live_window appears to make a single attempt to remove old segment files, never attempting a retry.

For example, this happens when throttling a 1 Gbps link down to 750 Kbps just when the player is starting the next high-res segment. That segment can stay open for some time before the player quits trying.

Workaround: At the end of my packager startup script, I added this code:

while true; do
  sleep 10
  find "${WWW_ROOT}/www" -mmin +1 -name "*.m4s" -delete >/dev/null 2>&1
done

Possible resolution strategy: Keep a list of all segments generated, and try to delete all segments outside the window each time a new segment is generated, removing the list entry when deletion succeeds. This will ensure multiple tries for every segment outside the window. Perhaps provide an error after a significant number of deletion attempts fail (~10 or so), to permit the operator to address downstream apps that maybe keeping the file open/locked.

Originally posted by @BobCu in #509 (comment)

@BobCu
Copy link
Author

BobCu commented Dec 14, 2018

Rather than change the packager, adding the above workaround to the docs may be enough to close this issue.

@kqyang
Copy link
Contributor

kqyang commented Dec 14, 2018

Yes, we need to improve the packager to try to retry the file deletion if it fails. I'll add it to v2.4 milestone for now.

But it is not a good sign if every file deletion needs to be retried.

As mentioned in #509 (comment), if we know that the server needs to hold the file for some more time, it is better to give it more time, e.g. by increasing --preserved_segments_outside_live_window (6 equals 1 minute for a segment duration of 10 seconds).

@kqyang kqyang added type: enhancement New feature or request and removed needs triage labels Dec 14, 2018
@kqyang kqyang added this to the v2.4 milestone Dec 14, 2018
@BobCu
Copy link
Author

BobCu commented Dec 15, 2018

Not "every", just the few that fail. Even a low-percentage of failures will eventually consume all storage space. That's not a good sign!

@kqyang
Copy link
Contributor

kqyang commented Dec 15, 2018

Not "every", just the few that fail.

Ok, that is much better. Anyway, agree that we'll need to have the retry logic so the file does not accumulate!

@kqyang kqyang self-assigned this Jan 8, 2019
@BobCu
Copy link
Author

BobCu commented Jan 8, 2019

I won't get to test the fix for a little while, but the code looks great.

Thanks!

@ohMoshko
Copy link

Hello,
I am testing a case of antenna fading signal.
The fix in 6d6db76 does retry to delete old segments if it failed before; however, when shaka-packager is fed with corrupted data (due to low signal power for example) for a long enough period of time, old segments get accumulated and the application eventually crashes. If, at any point before the application crashes, “good” data comes in then the function to remove the old segments gets called and the old segments are removed.
It seems like the functions that delete the old segments- RemoveOldSegment for HLS and RemoveSegments for DASH, don't get called when the data is corrupted but new segments are still being added. Is there a reason why those functions don't get called all the time? For example, in packager/hls/base/media_playlist.cc the call to RemoveOldSegment() is at the end of the SlideWindow() function. Why not call it separately before calling SlideWindow() to ensure the cleanup of old files?

I am using an attenuator to reproduce the issue. The attenuator receives ATSC data and sends it to an ATSC RF receiver. The data is then turns into IP streams, fed into shaka packger and eventually being streamed out. I raise the level of the attenuator until the data is affected by it (can visually see pixelation) and let it run.

@kqyang
Copy link
Contributor

kqyang commented Feb 26, 2019

@ohMoshko Interesting. I admitted that we did not take data corruption into consideration when designing this solution.

Do you mind describing a little bit more on what is happening? For example, what does packager see for "bad" data? Valid TS streams with incorrect timestamps?

the application eventually crashes

What is causing the crash? Out of storage space? Out of memory?

Is there a reason why those functions don't get called all the time?

I do not know. One possibility is that the timestamp is corrupted causing SlideWindow() to exist early: https://github.com/google/shaka-packager/blob/master/packager/hls/base/media_playlist.cc#L581; it might also be possible that MediaPlaylist::SlideWindow() or MediaPlaylist::AddSegmentInfoEntry() is not invoked at all.

Why not call it separately before calling SlideWindow() to ensure the cleanup of old files?

The files to be deleted are stored in segments_to_be_removed_, which is only updated when SlideWindow() is called, so segments_to_be_removed_ does not grow if SlideWindow() is not called.

@ohMoshko
Copy link

ohMoshko commented Feb 28, 2019

I am feeding the packager with different ATSC services. The services that are affected by the low signal power start accumulating .mp4 files (audio and video). The services’ media directory gets to around ~900MB and then the Linux OOM_killer is invoked and kills my app that is running shaka-packager:
oom_killer_shaka_log.txt

I can confirm that when the signal is fading, MediaPlaylist::SlideWindow() and MediaPlaylist::AddSegmentInfoEntry() are invoked but the timestamp is probably corrupted causing SlideWindow() to exit early:
https://github.com/google/shaka-packager/blob/master/packager/hls/base/media_playlist.cc#L581
If the signal is "good" again, SlideWindow() does not exit at the above if statement and the RemoveOldSegments() function is called and the old segments are removed correctly.
So I think that it shaka sees a bad data as a valid TS stream with incorrect timestamps.
When the function exits, current_play_time is not 0. It is usually between 1 and 14.
I analyzed the MPEG TS using tsDuck and observed large Discontinuities numbers for some of the PIDs (as expected).
If SlideWindow() continuously exits early, the media folder gets bigger and bigger and eventually the OOM_killer is invoked.

I can record the TS and send it to you if that would help.

@kqyang
Copy link
Contributor

kqyang commented Feb 28, 2019

@ohMoshko This is probably a different issue than the original issue in #533. Let's discuss in #563.

@shaka-bot shaka-bot added the status: archived Archived and locked; will not be updated label Mar 9, 2019
@shaka-project shaka-project locked and limited conversation to collaborators Mar 9, 2019
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
status: archived Archived and locked; will not be updated type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants