Use file indexer directly when scanning with file source #3333

adammcclenaghan · 2024-10-15T11:03:26Z

Description

This PR alters file_source.go to use a new file indexer, rather than the existing directory indexer.

Currently, when scanning a non-archive file, file_source.go applies a filter function to the directory indexer such that all files other than the file being scanned and its parent directory are ignored by the directory indexer. See here.

This approach becomes problematic when the scanned file is inside a directory with a large number of files, for two reasons:

Go’s filepath.Walk provides lexical ordering guarantees by reading the entire directory into memory. For sufficiently large containing directories, scanning a single file takes many GB’s of memory.
The total time to scan the single file also increases wrt the number of files in the containing directory due to the directory walk and the time taken to perform memory allocation etc.

This Pprof shows heap allocation when scanning a file within a directory containing a large number of files, I’m including it here as proof of my root cause analysis

Walking all of the files in the containing directory is redundant when using a file source, since as mentioned above the filter function will ignore everything other than the scanned file and its parent dir.

In this change, I have added a new file indexer which should match the existing behaviour of the directory indexer for a single file source. However, instead of walking the file system, it simply makes an attempt to index the containing directory and the file target.

I have also added file.go to satisfy the resolver interface when using the file indexer. Much of the functionality matches that of directory.go and I would appreciate it if there are any suggestions for improvement here, as I appreciate there's a bit of duplicated code.

The existing directory.go has many unit tests to verify behaviour in the event that the directory being walked contains symlinks etc. I have attempted to simplify the unit tests for file.go as it does not have to handle all of the complexity that directory.go does, but I would really appreciate extra review attention in this area as I may not be aware of all the ways a target for file_source may be defined.

I haven’t got a pprof diagram for the new approach, but memstat profiling has shown O(1) heap use wrt the number of files in the containing directory when using file source as expected.

Additionally, creating a resolver via a file_source is also happening in O(1) time wrt the number of files in the containing directory too.

Type of change

Performance (make Syft run faster or use less memory, without changing visible behavior much)

Checklist:

I have added unit tests that cover changed behavior
I have tested my code in common scenarios and confirmed there are no regressions
I have added comments to my code, particularly in hard-to-understand sections

Prevents filesystem walks when scanning a single file, to optimise memory & scan times in case the scanned file lives in a directory containing many files. Signed-off-by: adammcclenaghan <adam@mcclenaghan.co.uk>

syft/internal/fileresolver/file.go

Shared behaviour for resolving indexed filetrees. Signed-off-by: adammcclenaghan <adam@mcclenaghan.co.uk>

adammcclenaghan · 2024-11-08T12:16:41Z

Hey @wagoodman 👋

I wanted to check in to see if there's anything else you think we need for this change, or if you're otherwise happy with the latest set of changes I've made wrt your last review

wagoodman

awesome work -- thanks for taking the time profiling and contributing the fix 🙌

Use file indexer when scanning with file source

987578e

Prevents filesystem walks when scanning a single file, to optimise memory & scan times in case the scanned file lives in a directory containing many files. Signed-off-by: adammcclenaghan <adam@mcclenaghan.co.uk>

wagoodman reviewed Oct 15, 2024

View reviewed changes

syft/internal/fileresolver/file.go Outdated Show resolved Hide resolved

Create filetree resolver

038ecae

Shared behaviour for resolving indexed filetrees. Signed-off-by: adammcclenaghan <adam@mcclenaghan.co.uk>

wagoodman added performance bug Something isn't working labels Nov 22, 2024

wagoodman changed the title ~~Performance: Use file indexer when scanning with file source~~ Use file indexer directly when scanning with file source Nov 22, 2024

wagoodman approved these changes Nov 22, 2024

View reviewed changes

wagoodman merged commit 21df387 into anchore:main Nov 22, 2024
12 checks passed

BrewTestBot mentioned this pull request Dec 9, 2024

syft 1.18.0 Homebrew/homebrew-core#200610

Merged

crazy-max mentioned this pull request Dec 10, 2024

vendor: update github.com/anchore/syft to v1.18.0 docker/buildkit-syft-scanner#116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use file indexer directly when scanning with file source #3333

Use file indexer directly when scanning with file source #3333

adammcclenaghan commented Oct 15, 2024

adammcclenaghan commented Nov 8, 2024

wagoodman left a comment

Use file indexer directly when scanning with file source #3333

Use file indexer directly when scanning with file source #3333

Conversation

adammcclenaghan commented Oct 15, 2024

Description

Type of change

Checklist:

adammcclenaghan commented Nov 8, 2024

wagoodman left a comment

Choose a reason for hiding this comment