Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Support describing license properties and SPDX expression assertions #1577

Closed
spiffcs opened this issue Feb 16, 2023 · 3 comments · Fixed by #1743
Closed

Support describing license properties and SPDX expression assertions #1577

spiffcs opened this issue Feb 16, 2023 · 3 comments · Fixed by #1743
Assignees
Labels
enhancement New feature or request license relating to software licensing

Comments

@spiffcs
Copy link
Contributor

spiffcs commented Feb 16, 2023

Syft License Revamp

Syft currently represents license as different datatypes depending on the section of the schema it appears at:

AlpmMetadata: string (required)
ApkMetadata: string (required)
GemMetadata: []string (optional)
NpmPackageJSONMetadata: []string (required)
Package: []string (required)
PhpComposerJSONMetadata: []string (optional): 
PythonPackageMetadata: string (required)
RpmMetadata: string (required)

Specifically, the package []string construct has proven to be a bit limited in how the data can be represented to a user interested in license compliance. Many packages now use SPDX LICENSE ID to communicate FOSS license information. These identifier are currently incompatible with how we represent license given the complex nature of some of the constructs. Example:

// SPDX-License-Identifier: Apache-2.0 AND (MIT OR GPL-2.0-only)

NOTE FROM COMMUNITY MEET:

  • String will not be deprecated, but possibly the AST (abstract syntax tree) will be the preferred representation

The above shows a case where the consumer of the software can choose to use Apache-2.0 and one of the following: MIT, OR GPL-2.0-only.

The file is subject to both the Apache-2.0 license, and at the licensee’s choice either the MIT license or version 2.0 only of the GPL.
The licensee may choose between MIT and GPL-2.0.
Whichever they choose, they must comply with both that license and Apache-2.0.

Furthermore, syft's current licenses format is limited in representing the distinction between DECLARED vs CONCLUDED

The SPDX format gives implementers the choice in determining if a license should be in the concluded license field or the declared license field:

Concluded

TODO: Update this description based on feedback from community meeting
Contain the license the SPDX document creator has concluded as governing the package or alternative values, if the governing license cannot be determined.

If the Concluded License is not the same as the Declared License (7.15), a written explanation should be provided in the Comments on License field (7.16). With respect to NOASSERTION, a written explanation in the Comments on License field (7.16) is preferred. If the Concluded License field is not present in a package, it implies an equivalent meaning to NOASSERTION.

Declared

List the licenses that have been declared by the authors of the package. Any license information that does not originate from the package authors, e.g. license information from a third-party repository, should not be included in this field.

Syft's approach

Syft should enhance the license representation from []string to []License in order to convey the above information more clearly. The following struct will be added in favor of string to give downstream tooling more options in accurately reading how the license was determined at syft's run:

type Licenses struct {
    SPDXExpression string // expression used to derive the below licenses
    Licenses: []License // licenses and their give metadata
}

type License struct {
    Name string
    Location Location
    Concluded bool // if false then we can assume decalred? NOTE: Update this from meeting notes about when we should declared concluded
    Confidence int
    Offset int
    Extent int
}

type Location struct {
    Path string
    LayerID string
}

Here is a sample of the json representation of the above:

{
  "spdxLicenseExpression":"mit AND (LGPL-2.1-or-later OR BSD-3-Clause)",
  "licenses":[
    {
      "Name":"LGPL",
      "location":{
        "path":"/lib/apk/db/installed",
        "layerID":"sha256:ded7a220bb058e28ee3254fbba04ca90b679070424424761a53a043b93b612bf"
      },
      "concluded":true, 
      "confidence":0.92,
      "Offset":0,
      "Extent":23829
    }
  ]
}

In the event a license is successfully concluded the above uses google license classifier to accurately assess the license packaged with the software. If provides the confidence level (how close a match was given the locations contents compared to some source DB), the ofset (how far into the file the match was found), and the extent (how long the match was).

Why is this needed:
This enhancement is needed so syft can better represent SPDX license expression intentions, illustrate more data on where the license, if concluded, was found, and give downstream tools looking to use SBOM for license compliance more tooling/accuracy in assessing the license contents against policy they create.

@spiffcs spiffcs added the enhancement New feature or request label Feb 16, 2023
@spiffcs spiffcs changed the title Syft License revamp Syft License Revamp Feb 16, 2023
@wagoodman wagoodman added the license relating to software licensing label Feb 16, 2023
@wagoodman
Copy link
Contributor

wagoodman commented Mar 1, 2023

Adjustments from an offline conversation:

TODO: what does this look like if there are external sources considered?
type License struct {
   Nodes: []SpdxLicense // licenses and their give metadata nodes
   Relations: []Edge // relation of license nodes for complex expressions
   *LicenseEvidence
}

const (
   LicenseDeclaredType = "declared"
   LicenseConcludedType = "concluded"
)

type LicenseType string

type SpdxLicense struct {
    Name string
    Type LicenseType
}

type LicenseEvidence struct {
    Raw string // expression used to derive the below licenses
    Location Location
    URL string
    Confidence int
    Offset int
    Extent int
}

// use from source pkg...
type Location struct {
    Path string
    LayerID string
}

This tweaks the structs a bit:

  • Evidence is optional and compartmentalized away from the License itself
  • Replace the "concluded" field with something that is extensible and mutually exclusive, the "type" field (with an enum)
  • The top level "value" is not necessarily a license expression, but could be, so just renamed the variable

@spiffcs
Copy link
Contributor Author

spiffcs commented Mar 1, 2023

More adjustments from an offline conversation:

type Package struct {
    Licenses []PackageLicense
}

type File struct {
    Licenses []FileLicense
}

type PackageLicense struct {
    ParsedExpression ParsedExpression
    URL string // external sources
    Location Location // on disk declaration
}

type FileLicense struct {
    ParsedExpression ParsedExpression
    Location Location
   *LicenseEvidence // evidence from license classifier
}

type ParsedExpression struct {
    RawValue string
    ValidSPDXExpression string
}

type SpdxLicense struct {
    Name string
    Type LicenseType // always declared for now -- see community notes
}

const (
   LicenseDeclaredType = "declared"
   LicenseConcludedType = "concluded"
)

type LicenseType string

type LicenseEvidence struct {
    Confidence int
    Offset int
    Extent int
}

// use from source pkg...
type Location struct {
    Path string
    LayerID string
}

This tweaks the structs a bit more to distinct licenses between package/file discovery:

  • Evidence is optional and compartmentalized away from the License itself and only used in case of on disk analysis
  • Location and URL are used in the case of a package construction
  • Location and Evidence are used in the case of a file construction
  • License can be for package or for file
    -{Location, URL} and {Location, Evidence} can be represented in json as a shared "From" field with type (local vs URL)

@deitch
Copy link
Contributor

deitch commented Mar 11, 2023

I am at a bit of a loss as to how []License (or any of its variants) is all that much better than []string. The fundamental problem with both is that it only can handle one kind of relationship among the elements of the slice. Either all are joined by AND or all are joined by OR. It simply cannot handle the complex apache and (mit or bsd).

What I tried to do with #1554 is replace string or []string with a tree that can handle all of those compound expressions.

It is possible that by this:

type Licenses struct {
    SPDXExpression string // expression used to derive the below licenses
    Licenses: []License // licenses and their give metadata
}

you meant that the authoritative license structure is the SPDXExpression string, and that the []License is just the list of licenses, i.e. the relationships between them are determined only by the SPDXExpression string, but that was not as clear to me.

It fits a bit with:

type License struct {
   Nodes: []SpdxLicense // licenses and their give metadata nodes
   Relations: []Edge // relation of license nodes for complex expressions
   *LicenseEvidence
}

but that separate listing of licenses and edges makes it hard to track them and likely to create errors.

It might kind of fit with this later one:

type PackageLicense struct {
    ParsedExpression ParsedExpression
    URL string // external sources
    Location Location // on disk declaration
}
type ParsedExpression struct {
    Value string
    Nodes []SpdxLicense
}

But has the same problem.

I do see the structure of the last option here (not sure I agree on the distinction between PackageLicense and FileLicense, but I am missing a lot of context, so passing on that).

Either way, why would we not have something like this:

type Joiner string

 const (
 	AND Joiner = "AND"
 	OR  Joiner = "OR"
 )

type PackageLicense struct {
    ParsedExpression ParsedExpression
    Raw string   // original string that was parsed to give this parsed expression
    URL string // external sources
    Location Location // on disk declaration
}
type ParsedExpression struct {
   Compound []ParsedExpression
   Simple   []string
   Joiner
}

// Licenses list all of the individual licenses, independent of their relationships
func (p PackageLicense) Licenses() []string {
}

@spiffcs spiffcs self-assigned this Mar 16, 2023
@spiffcs spiffcs linked a pull request Apr 27, 2023 that will close this issue
@wagoodman wagoodman changed the title Syft License Revamp Support describing license properties and SPDX expression assertions May 22, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request license relating to software licensing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants