Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Today the dotnet PE cataloger reads the entire binary into memory to make it available to the saferwall/pe lib for parsing. This is problematic since that could result in lots of unnecessary allocations.
Take for example scanning the dotnet sdk with dependencies (~14,000 dlls) downloaded; here's the total allocated (
alloc_space
in pprof) memory by today's cataloger:This PR adds a PE parser that is only concerned with gathering the version resources used to determine the artifact's name and version. Since this ultimately reads much fewer bytes through the underlying file resolver we allocate much less memory overall:
In this case, this saves about 48GB of memory being allocated overall 🎉.
In terms of wall clock time, the original scan took ~80 seconds, with the adjustments on this branch it takes ~35 seconds 🎉.
This has no impact on unreleased memory (
inuse_space
in pprof) since in both cases any analyzed material is still released and results in the same number of packages (ish).One caveats is that I'm still seeing a difference of ~20 packages from the total tested of ~14,000, so there is still something to work out / debug on this branch.
An alternative considered would be to plumb though mmap capabilities from stereoscope through to syft, as well as enhance the directory resolver similarly (mmap files read via that resolver). This is promising but more impactful and will take more planning. PE is a well known format and for our purposes we only need to parse a small portion to gather the raw version information.
Type of change
Checklist:
TODO: