More performant dotnet PE parser #3563

wagoodman · 2025-01-03T19:36:55Z

Description

Today the dotnet PE cataloger reads the entire binary into memory to make it available to the saferwall/pe lib for parsing. This is problematic since that could result in lots of unnecessary allocations.

Take for example scanning the dotnet sdk with dependencies (~14,000 dlls) downloaded; here's the total allocated (alloc_space in pprof) memory by today's cataloger:

This PR adds a PE parser that is only concerned with gathering the version resources used to determine the artifact's name and version. Since this ultimately reads much fewer bytes through the underlying file resolver we allocate much less memory overall:

In this case, this saves about 48GB of memory being allocated overall 🎉.

In terms of wall clock time, the original scan took ~80 seconds, with the adjustments on this branch it takes ~35 seconds 🎉.

This has no impact on unreleased memory (inuse_space in pprof) since in both cases any analyzed material is still released and results in the same number of packages (ish).

One caveats is that I'm still seeing a difference of ~20 packages from the total tested of ~14,000, so there is still something to work out / debug on this branch.

An alternative considered would be to plumb though mmap capabilities from stereoscope through to syft, as well as enhance the directory resolver similarly (mmap files read via that resolver). This is promising but more impactful and will take more planning. PE is a well known format and for our purposes we only need to parse a small portion to gather the raw version information.

Partially addresses Syft scan in offline mode is slow #3455

Type of change

Performance (make Syft run faster or use less memory, without changing visible behavior much)

Checklist:

I have added unit tests that cover changed behavior
I have tested my code in common scenarios and confirmed there are no regressions
I have added comments to my code, particularly in hard-to-understand sections

TODO:

Add tests
Find out why scanning dotnet sdk misses 20 packages

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

[wip] prototype PE binary parser

65ade9d

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More performant dotnet PE parser #3563

More performant dotnet PE parser #3563

wagoodman commented Jan 3, 2025

More performant dotnet PE parser #3563

Are you sure you want to change the base?

More performant dotnet PE parser #3563

Conversation

wagoodman commented Jan 3, 2025

Description

Type of change

Checklist:

TODO: