Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Support to stack multiple Syft SBOM files into a single one #617

Open
hectorj2f opened this issue Nov 5, 2021 · 18 comments
Open

Support to stack multiple Syft SBOM files into a single one #617

hectorj2f opened this issue Nov 5, 2021 · 18 comments
Labels
blocked Progress is being stopped by something enhancement New feature or request multiple-sources Issues that are dependent on supporting multiple sources

Comments

@hectorj2f
Copy link
Contributor

What would you like to be added:
CycloneDX supports merging multiple SBOM files into a single one. However Syft SBOM does not support merging multiple Syft SBOM files. One way could be done by manual merging the artifacts however that is limited to SBOM files whose distro is the same for all the different files.

We would like that Syft SBOM can be merged into a single one without lacking of accuracy to identify vulnerabilities.

Why is this needed:

We generate a bundle and store it in a registry, but the artifacts linked to that bundle are composed of multiple SBOMS from each component that forms that bundle. When using CycloneDX, those SBOM files can be merged into one without losing accuracy to identify vulnerabilities. This can be achieved because the purl contains the arch (or distro) injected into the value. We would like to find a solution to satisfy the same scenario with Syft SBOM format.

Additional context:

@hectorj2f hectorj2f added the enhancement New feature or request label Nov 5, 2021
@hectorj2f hectorj2f changed the title Support for stacks of Syft SBOM files merged into a single one Support to stack multiple Syft SBOM files into a single one Nov 6, 2021
@luhring
Copy link
Contributor

luhring commented Nov 7, 2021

Would JSON Lines work for this situation? The idea would be to keep the change pretty simple, and support processing concatenated instances of the existing Syft JSON output.

One way could be done by manual merging the artifacts however that is limited to SBOM files whose distro is the same for all the different files.

❤️ I like the idea of applying this constraint, especially to the first iteration.

Would we have to support differing schema versions among the received JSON objects? Perhaps a constraint similar to the above could be applied here (i.e. require homogeneous JSON objects).

Would Syft's existing explicit sbom: be how users also specify this new merged SBOM file format? Or should a new scheme be introduced? In a somewhat analogous situation, the jq tool does require the user's command to be adapted to handle JSON Lines, by adding the --slurp flag. I'm not proposing that Syft add a CLI flag for this, but we should think about how broadly sbom: can be used to set Syft's expectations on input format.

@wagoodman
Copy link
Contributor

wagoodman commented Nov 8, 2021

@hectorj2f I 100% agree with the observations on the distro section and how we could leverage reading of pURLs here to help (there could be a lot of overlap with anchore/grype#481 (comment) here). This helps grype be able to match with pURLs generically on each package and further support input from other SBOM tools in the future.

I think source is the last field that is a problem (both on input SBOMs and output grype JSON documents).

  1. Do we require a single input document to be a "merge" of multiple documents? Or can it be a concatenation of multiple documents? (this is essentially the same question @luhring has here on JSONL being a possible direction)
  2. Do we want to preserve multiple source input fields? Or is it OK for this to be lossy?
  3. Do we want the output to be able to reflect the matches that belong to specific sources? Or more generically, if we have an input that represents multiple inputs, do we want a "merged" output? Or multiple separated outputs?

My take on possible answers here:

  1. Could we support both? I have an observation for the case of merging documents into one: the input the caller crafts already takes on the burden of making an input SBOM that is lossy (where there may be no source section, or one of these values is selected).
  2. I would prefer that grype not be the decision maker for what input fields are lossy when it comes to the source section. This implies for possibly changing the grype output JSON shape to account for preserving fields (or to group together multiple grype output JSON documents).
  3. I don't have enough information to answer here other than "what might be easier for a consumer" perspective. It is always easier to deal with a single document over multiple output documents. However, in the case of JSONL as input I can also see it being reasonable to output as many documents as you accepted as input and that be "expected".

@spiffcs
Copy link
Contributor

spiffcs commented Nov 8, 2021

If we're getting into the business of merging .syft.json SBOM I have a question regarding our new ID generation process.

Example:

Here is the stereoscope artifact as seen generated under the grype SBOM.

  {
   "id": "17857980119146230574",
   "name": "github.com/anchore/stereoscope",
   "version": "v0.0.0-20211024152658-003132a67c10",
   "type": "go-module",
   "foundBy": "",
   "locations": [
    {
     "path": "/grype"
    }
   ],
   "licenses": [],
   "language": "go",
   "cpes": [
    "cpe:2.3:a:anchore:stereoscope:v0.0.0-20211024152658-003132a67c10:*:*:*:*:*:*:*"
   ],
   "purl": "pkg:golang/github.com/anchore/stereoscope@v0.0.0-20211024152658-003132a67c10",
   "metadataType": "GolangBinMetadata",
   "metadata": {
    "GoCompiledVersion": "go1.16.9",
    "H1Digest": "h1:BmK/CgNlu+X9foWK2ZAIehxzYws760AZSGVNamQZpiw="
   }
  },

Here is the same artifact as seen generated under the syft SBOM

  {
   "id": "9509475418339205315",
   "name": "github.com/anchore/stereoscope",
   "version": "v0.0.0-20211024152658-003132a67c10",
   "type": "go-module",
   "foundBy": "",
   "locations": [
    {
     "path": "/syft"
    }
   ],
   "licenses": [],
   "language": "go",
   "cpes": [
    "cpe:2.3:a:anchore:stereoscope:v0.0.0-20211024152658-003132a67c10:*:*:*:*:*:*:*"
   ],
   "purl": "pkg:golang/github.com/anchore/stereoscope@v0.0.0-20211024152658-003132a67c10",
   "metadataType": "GolangBinMetadata",
   "metadata": {
    "GoCompiledVersion": "go1.16.9",
    "H1Digest": "h1:BmK/CgNlu+X9foWK2ZAIehxzYws760AZSGVNamQZpiw="
   }
  },

The ID are different in this case because the locations field contains different paths for where the artifact was discovered.

In the merged SBOM do we want to try and account for these being basically identical and merging the locations field under a single artifact, or do we want to keep things totally separate?

This also adds more weight to the mergeByLocations functionality we considered adding in this previous PR where an Artifact can be discovered at multiple locations making the merged document much more efficient to consume.

#595

Or... more generally. How could we merge artifacts that have the same ID?

@spiffcs
Copy link
Contributor

spiffcs commented Nov 8, 2021

Also just validated - Removing the meta top-level fields from *.syft.json still produces a usable json structure.

grype only needs:

{
  artifacts: []
}

Valid Input to grype grype test.syft.json

// test.syft.json
{
 "artifacts": [
  {
   "id": "11941904915510831158",
   "name": "github.com/docker/cli",
   "version": "v0.0.0-20191017083524-a8ff7f821017",
   "type": "go-module",
   "foundBy": "",
   "locations": [
    {
     "path": "/syft",
     "layerID": "sha256:f935f03ffddb44eccea2457b11b6ffc04a336a76a57648dc78518b79b1c523b4"
    }
   ],
   "licenses": [],
   "language": "go",
   "cpes": [
    "cpe:2.3:a:docker:cli:v0.0.0-20191017083524-a8ff7f821017:*:*:*:*:*:*:*"
   ],
   "purl": "pkg:golang/github.com/docker/cli@v0.0.0-20191017083524-a8ff7f821017",
   "metadataType": "GolangBinMetadata",
   "metadata": {
    "GoCompiledVersion": "go1.16.9",
    "H1Digest": "h1:2HQmlpI3yI9deH18Q6xiSOIjXD4sLI55Y/gfpa8/558="
   }
  },
  {
   "id": "3512196908125928147",
   "name": "github.com/docker/distribution",
   "version": "v2.7.1+incompatible",
   "type": "go-module",
   "foundBy": "",
   "locations": [
    {
     "path": "/syft",
     "layerID": "sha256:f935f03ffddb44eccea2457b11b6ffc04a336a76a57648dc78518b79b1c523b4"
    }
   ],
   "licenses": [],
   "language": "go",
   "cpes": [
    "cpe:2.3:a:docker:distribution:v2.7.1+incompatible:*:*:*:*:*:*:*"
   ],
   "purl": "pkg:golang/github.com/docker/distribution@v2.7.1+incompatible",
   "metadataType": "GolangBinMetadata",
   "metadata": {
    "GoCompiledVersion": "go1.16.9",
    "H1Digest": "h1:a5mlkVzth6W5A4fOsS3D2EO5BUmsJpcB+cRlLU7cSug="
   }
  },
  {
   "id": "4268275803743278876",
   "name": "github.com/docker/docker",
   "version": "v17.12.0-ce-rc1.0.20200309214505-aa6a9891b09c+incompatible",
   "type": "go-module",
   "foundBy": "",
   "locations": [
    {
     "path": "/syft",
     "layerID": "sha256:f935f03ffddb44eccea2457b11b6ffc04a336a76a57648dc78518b79b1c523b4"
    }
   ],
   "licenses": [],
   "language": "go",
   "cpes": [
    "cpe:2.3:a:docker:docker:v17.12.0-ce-rc1.0.20200309214505-aa6a9891b09c+incompatible:*:*:*:*:*:*:*"
   ],
   "purl": "pkg:golang/github.com/docker/docker@v17.12.0-ce-rc1.0.20200309214505-aa6a9891b09c+incompatible",
   "metadataType": "GolangBinMetadata",
   "metadata": {
    "GoCompiledVersion": "go1.16.9",
    "H1Digest": "h1:G2hY8RD7jB9QaSmcb8mYEIg8QbEvVAB7se8+lXHZHfg="
   }
  }
 ]
}

I'll follow here with a PR so the json presenter can parse this as well. It only works for table output.

@luhring
Copy link
Contributor

luhring commented Nov 8, 2021

In the merged SBOM do we want to try and account for these being basically identical and merging the locations field under a single artifact, or do we want to keep things totally separate?

Are we asking this from the perspective of Grype? Does Grype have a need to see a given package only a single time? My two cents would be not to introduce any deduplication — particularly in a first iteration — unless we need to.

@luhring
Copy link
Contributor

luhring commented Nov 8, 2021

Open question: Should this issue move to anchore/grype?

I know the implementation might involve Syft's decoders, but it seems like a Grype feature that we're talking about from the user's perspective. Is that a correct read on this?

@spiffcs
Copy link
Contributor

spiffcs commented Nov 9, 2021

Another question I found - Is schema a required field for merge, or can/should we update the validator code so that if schema is not detected then we still select off artifacts existing?

https://github.com/anchore/syft/blob/main/internal/formats/syftjson/validator.go

This would make the JSON API a little simpler since then both syft and grype would need just artifacts: []

We're at 1.1.0 so until we decide to move to 2.0.0 all changes should be backward compatible

@hectorj2f
Copy link
Contributor Author

I know the implementation might involve Syft's decoders, but it seems like a Grype feature that we're talking about from the user's perspective. Is that a correct read on this?

In my opinion, it won't be bad a cli command to merge or append multiple sboms files into one.

@Dentrax
Copy link

Dentrax commented Mar 28, 2022

We (@developer-guy) have been trying to figure it out how we can merge multiple sbom results into single one and finally ended up here.

We're generating 2 SBOMs by issuing following commands:

$ syft packages dir:"$SCAN_DIRECTORY" -o cyclonedx-json > sbom-results-build-time.json
$ syft packages docker:"$DOCKER_IMAGE" -o cyclonedx-json > sbom-results-container-image.json

We found the following workaround:

      jq -s 'def deepmerge(a;b):
      reduce b[] as $item (a;
        reduce ($item | keys_unsorted[]) as $key (.;
          $item[$key] as $val | ($val | type) as $type | .[$key] = if ($type == "object") then
            deepmerge({}; [if .[$key] == null then {} else .[$key] end, $val])
          elif ($type == "array") then
            (.[$key] + $val | unique)
          else
            $val
          end)
        );
      deepmerge({}; .)' sbom-results-build-time.json sbom-results-container-image.json > sbom-results-merged.json

Ref: https://e.printstacktrace.blog/merging-json-files-recursively-in-the-command-line

It just works!

It would be nice to have a subcommand, merge, for example:

$ syft merge sbom-results-build-time.json sbom-results-container-image.json -o sbom-results-merged.json

We assumed here both SBOMs are same type. In case one of is different, then we should probably convert it: #563

cc @luhring

@spiffcs
Copy link
Contributor

spiffcs commented Mar 30, 2022

@Dentrax thanks so much for the follow-up here with the CLI version of merging SBOM.

I like the suggestion of syft taking in two files and doing the merge itself.

Is there another interaction we want to explore here where syft generates the SBOM from the artifacts and merges in the same command?

syft --merge -o json build-time container-image > sbom-results-merged.json

@spiffcs spiffcs added this to OSS May 31, 2022
@spiffcs spiffcs moved this to Triage (Comments or Progress Made) in OSS May 31, 2022
@Dentrax
Copy link

Dentrax commented Jun 16, 2022

How should we proceed here? There is another way to handle this by accepting multiple args in packages subcommand instead of introduce new merge command:

$ syft packages dir:"$SCAN_DIRECTORY" docker:"$DOCKER_IMAGE" ... -o cyclonedx-json

Also merge sub command is nice to have:

$ syft merge foo.json bar.json baz.json ... -o json > merged.json

@Dentrax
Copy link

Dentrax commented Jun 22, 2022

Kind ping 🤞 @luhring @spiffcs

@wagoodman
Copy link
Contributor

wagoodman commented Jun 22, 2022

...pong!

[is] there another way to handle this by accepting multiple args in packages subcommand instead of introduce new merge command?

This is something we're interested in doing (see #562).

The issue isn't really how to scan multiple targets at a time, and how to specify it on the CLI --this is the easy part. The real problem is how should an SBOM describe multiple sources? Today we have a singular source block that describes what was cataloged such that it can be assumed that all other parts of the SBOM were found within what's described in the source block.

If we scan multiple targets then the SBOM will need to additionally convey:
a. all sources scanned (maybe change source to a list?)
b. which packages/files belong to which sources

Item b can be done in a few ways:

  1. keep all artifacts (packages, files, etc) from all sources in a single container, add individual package-to-source relationships to figure which artifact came from which source

Pros:

  • artifacts and sources are highly decoupled

Cons:

  • this will add a considerable number of relations to even "simple" SBOMs
  • consumers wanting to ask questions about packages from source X would need to sift through all of the relationship data to answer even simple questions, which seems really complicated (and worse, that complication is being pushed onto the consumer... ideally syft would handle this if it were possible)
  1. partition artifacts based on which source it is from.

Pros:

  • scales better than relationships

Cons:

  • cannot easily ask questions about artifacts across source barriers . I feel that this is a common use case for merged SBOMs... or to put it another way, if we are separating SBOMs out logically by which document they reside in then whats the difference between this option and taring up multiple documents?

Open questions:

  • how do we support SPDX/CycloneDX in these cases? Potentially output multiple files (which would defeat the point I feel)?

Previous questions that have been answered by this point:

  • at one point when addressing this for OS packages that depended on distro information there wasn't a path to scan multiple targets and still retain the correct distro info for the set of packages it would affect in vulnerability matching in grype (e.g. the distro version and which namespace for RHEL packages to look up). We've now solved this by including this extra information in the pURL and we use pURLs to extract this information in grype while doing vulnerability matching.

I think that we need to answer how will the format address these problems / use cases before diving into how the command will look / feel.

@kzantow
Copy link
Contributor

kzantow commented Aug 30, 2022

FYI -- we are actively discussing this and for anyone interested it would be a great topic for the next community meeting, which will be Sep. 1 at 12 noon ET: https://calendar.google.com/calendar/u/0/r?cid=Y182OTM4dGt0MjRtajI0NnNzOThiaGtnM29qNEBncm91cC5jYWxlbmRhci5nb29nbGUuY29t

@Dentrax
Copy link

Dentrax commented Sep 30, 2022

Kind ping 🤞

@kzantow
Copy link
Contributor

kzantow commented Sep 30, 2022

@Dentrax we are actively working on this, it's a complicated change and will need to get support implemented in Grype, too before we get everything merged 👍

@kzantow kzantow assigned kzantow and unassigned spiffcs Nov 8, 2022
@kzantow kzantow moved this from Parking Lot (Comments or Progress) to In Progress (Actively Resolving) in OSS Nov 8, 2022
@kzantow
Copy link
Contributor

kzantow commented Dec 6, 2022

An update here: we are currently putting this on hold, as there have been a number of discussions which highlight making this a particularly challenging thing to implement properly. We do plan on revisiting this in the future when some of the roadblocks are lifted.

@kzantow kzantow moved this from In Progress (Actively Resolving) to Backlog (Pulled Forward for Priority) in OSS Dec 6, 2022
@kzantow kzantow removed their assignment Dec 6, 2022
@kzantow kzantow added the multiple-sources Issues that are dependent on supporting multiple sources label Dec 6, 2022
@spiffcs spiffcs moved this from Backlog (Pulled Forward for Priority) to Parking Lot (Comments or Progress) in OSS Dec 13, 2022
@tgerla tgerla removed the status in OSS Jan 31, 2023
@tgerla tgerla moved this to Backlog in OSS Feb 2, 2023
@tgerla tgerla removed the status in OSS Feb 2, 2023
@spiffcs spiffcs added the blocked Progress is being stopped by something label Feb 9, 2023
@spiffcs
Copy link
Contributor

spiffcs commented Feb 9, 2023

Just a small update on this issue - We've marked this as blocked since there is not a clear way forward on the representation of multiple SBOM across all the formats syft supports.

There is currently some work being done on representing SBOM connections via relationships that are surfaced when running the SBOM cataloger. There is a clear way forward to represent this internally with the syft SBOM format, but larger questions exist on how to do this via the different standard formats of SPDX and CycloneDX.

If the SBOMS all have the same source node there is a straightforward path for all 3 formats given that we can associate all packages to the same source node with the correct relationships.

The path forward via the SBOM cataloger currently combines all packages with NO source relationships which we believe is not the desired end state of how the data should be represented or what is being asked for on this thread.

I'm dropping this into the backlog of our team's board while we work through the details surrounding the graph representation of what multiple sources for multiple sbom look like.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
blocked Progress is being stopped by something enhancement New feature or request multiple-sources Issues that are dependent on supporting multiple sources
Projects
Status: Backlog
Development

No branches or pull requests

6 participants