Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add LRU eviction mechanisms for streamed file chunks #11304

Merged
merged 9 commits into from
Oct 13, 2023

Conversation

rtibbles
Copy link
Member

@rtibbles rtibbles commented Sep 26, 2023

Summary

  • Adds a utility to get stats for all ChunkedFile objects in the file system
  • Adds ability to evict a certain number of bytes of the least recently used files (but will always delete the entire ChunkedFile rather than trying to pick and choose chunks)
  • Fixes a small bug in the StorageCalculation object that was double counting incomplete_downloads
  • Adds additional logic to content request processing to attempt to evict files from the streamed file cache when we run out of space
  • Adds capability to the new utility class to limit the total size of files to a set maximum
  • Does a minor refactor to put all core scheduled tasks into a single plugin that always gets invoked (to prevent the Android app continually having to keep this up to date: https://github.com/learningequality/kolibri-installer-android/blob/develop/src/kolibri_tasks.py)
  • Adds a new option to limit the size of the streamed files cache, defaults to 500MB
  • Adds a scheduled task that runs daily to limit the files cache to this maximum
  • Added a slight optimization to prevent os.walk descending into the chunked file directories at first
  • Updated to include the files from the diskcache folder in the size calculations
  • Updated to run the scheduled task hourly, rather than daily.

References

Fixes #9389

Reviewer guidance

Any concerns about the new option? Should the default value be lower?

I haven't benchmarked the os.walk performance on a large installed system, so I am unsure how slow it would get, there may be room for optimization there.

This also ignores the small diskcache that is used inside each ChunkedFile for the purposes of counting how much is used - should I include this instead?


Testing checklist

  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Critical and brittle code paths are covered by unit tests

PR process

  • PR has the correct target branch and milestone
  • PR has 'needs review' or 'work-in-progress' label
  • If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
  • If this is an important user-facing change, PR or related issue has a 'changelog' label
  • If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@rtibbles
Copy link
Member Author

Looks like my tests are assuming certain things about the file system (and I forgot to update the tests for the small tweak to the server.py).

@rtibbles
Copy link
Member Author

OK, hopefully I've fixed the test issues now - but I wouldn't bet against the Github Actions environment still doing something a little weird, possibly!

Copy link
Member

@nucleogenesis nucleogenesis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also ignores the small diskcache that is used inside each ChunkedFile for the purposes of counting how much is used - should I include this instead?

Does this mean we'll always be undercounting by some handful of bytes because the diskcache isn't accounted for? Looks like diskcache has a mechanism for eviction that defaults to "least recently stored" as the method but maybe we can just set it to LRU and not have to worry about it beyond that?

@rtibbles
Copy link
Member Author

Does this mean we'll always be undercounting by some handful of bytes because the diskcache isn't accounted for?

Yes - all we are using diskcache for is to cache a few things about the file, the total file size, and other relevant information we may have previously received in the HEAD request. We are also using it to coordinate download locks for chunks, so that multiple threads/processes can gain an exclusive lock to download a specific chunk of a file (and hence prevent duplication). Importantly, these diskcache instances are being created on a per ChunkedFile basis, so the eviction mechanisms aren't terribly useful to us for these purposes.

I think I probably should just go and count the diskcache file size into the total size, as it will give us a more accurate count of how much space we are freeing up by evicting the file.

Include diskcache directory in file size counts.
@rtibbles
Copy link
Member Author

Have updated this to include in the diskcache in the overall file size.

Comment on lines +132 to +134
for dirpath, _, filenames in os.walk(chunked_file_dir):
for file in filenames:
file_path = os.path.join(dirpath, file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how this is accounting for the diskcache differently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's now doing an os.walk of the entire ChunkedFile dir (including the directory that diskcache is using) so it's enumerating all the files.

Previously, it was only listing the files in the top directory, which was all the chunk files, plus the directory used for diskcache (which it was ignoring because it was a directory).

Copy link
Member

@jredrejo jredrejo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the code and tested it and everything works correctly.
However I think this implementation could be problematic in servers with a small disk (thinking for example of Raspberry PI servers using a SD card). The scheduled task to clean the cache could not be enough to clean all the cached space for users remotely browsing studio.

I think we should either provide an option to null this caching or at least increase the interval to run the cleaning task. Taking into account the cache has not limit to grow in disk, a 24 hours interval could be too long in some cases.

@rtibbles
Copy link
Member Author

Have updated now to run the task hourly.

@rtibbles rtibbles merged commit ef9a11b into learningequality:release-v0.16.x Oct 13, 2023
@rtibbles rtibbles deleted the stale_chunks branch October 13, 2023 14:34
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
DEV: backend Python, databases, networking, filesystem... SIZE: medium TODO: needs review Waiting for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement an LRU cache mechanism for cached remotely streamed content
3 participants