Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Use file indexer directly when scanning with file source #3333

Merged
merged 2 commits into from
Nov 22, 2024

Conversation

adammcclenaghan
Copy link
Contributor

Description

This PR alters file_source.go to use a new file indexer, rather than the existing directory indexer.

Currently, when scanning a non-archive file, file_source.go applies a filter function to the directory indexer such that all files other than the file being scanned and its parent directory are ignored by the directory indexer. See here.

This approach becomes problematic when the scanned file is inside a directory with a large number of files, for two reasons:

  1. Go’s filepath.Walk provides lexical ordering guarantees by reading the entire directory into memory. For sufficiently large containing directories, scanning a single file takes many GB’s of memory.
  2. The total time to scan the single file also increases wrt the number of files in the containing directory due to the directory walk and the time taken to perform memory allocation etc.

This Pprof shows heap allocation when scanning a file within a directory containing a large number of files, I’m including it here as proof of my root cause analysis
Screenshot 2024-10-15 at 11 45 59

Walking all of the files in the containing directory is redundant when using a file source, since as mentioned above the filter function will ignore everything other than the scanned file and its parent dir.

In this change, I have added a new file indexer which should match the existing behaviour of the directory indexer for a single file source. However, instead of walking the file system, it simply makes an attempt to index the containing directory and the file target.

I have also added file.go to satisfy the resolver interface when using the file indexer. Much of the functionality matches that of directory.go and I would appreciate it if there are any suggestions for improvement here, as I appreciate there's a bit of duplicated code.

The existing directory.go has many unit tests to verify behaviour in the event that the directory being walked contains symlinks etc. I have attempted to simplify the unit tests for file.go as it does not have to handle all of the complexity that directory.go does, but I would really appreciate extra review attention in this area as I may not be aware of all the ways a target for file_source may be defined.

I haven’t got a pprof diagram for the new approach, but memstat profiling has shown O(1) heap use wrt the number of files in the containing directory when using file source as expected.

Additionally, creating a resolver via a file_source is also happening in O(1) time wrt the number of files in the containing directory too.

Type of change

  • Performance (make Syft run faster or use less memory, without changing visible behavior much)

Checklist:

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

Prevents filesystem walks when scanning a single file, to
optimise memory & scan times in case the scanned file
lives in a directory containing many files.

Signed-off-by: adammcclenaghan <adam@mcclenaghan.co.uk>
Shared behaviour for resolving indexed filetrees.

Signed-off-by: adammcclenaghan <adam@mcclenaghan.co.uk>
@adammcclenaghan
Copy link
Contributor Author

Hey @wagoodman 👋

I wanted to check in to see if there's anything else you think we need for this change, or if you're otherwise happy with the latest set of changes I've made wrt your last review

@wagoodman wagoodman added performance bug Something isn't working labels Nov 22, 2024
@wagoodman wagoodman changed the title Performance: Use file indexer when scanning with file source Use file indexer directly when scanning with file source Nov 22, 2024
Copy link
Contributor

@wagoodman wagoodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome work -- thanks for taking the time profiling and contributing the fix 🙌

@wagoodman wagoodman merged commit 21df387 into anchore:main Nov 22, 2024
12 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working performance
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants