-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Handle compound license expressions #1554
Conversation
👋 Thanks for the PR @deitch We've got some work in progress that also is touching the license side of syft and I'd like to incorporate this PR since it does a good portion of the compound logic. The work in progress right now moves syft to consider license as a struct rather than string:
Here is a sample of the data structure we're trying to move licenses to:
The confidence score comes from using this license string classifier library to match the contents of an unknown text (contents of the license file) to a set of known texts (license db). Curious on your thoughts on how we fit compound expression handling into this kind of update - I thought it would be easier to move syft towards a more complex license descriptor before tackling the compound expressions, but doing both at the same time seems achievable given the work you put into this PR! |
I think both are achievable, although I would split it into two stages, and the order doesn't matter (except for my bias towards getting this one in 😁 ). The type LogicalStrings struct {
Compound []LogicalStrings
Simple []string
Joiner
} We could convert that into a type LogicalLicenses struct {
Compound [] LogicalLicenses
Simple []License
Joiner
} where each All of this assumes that you have a I would think that we would start with this one (besides my obvious bias), and then we can change the underlying structs when ready, but the flow and logic already will be in place. We still will just be passing around things. I will get the 3 CI failures fixed, although I don't get what the static analysis is complaining about. Some help would be appreciated. |
^ This is correct! Let me look at the static analysis. In the mean time - how much overlap are we seeing with libraries like: Is this PR meant to target SPDX expressions specifically, or are we trying to parse compound statements from other ecosystems that don't adheer to that license composite specification? |
The static analysis had to do with a schema bump. Ideally we don't do multiple changes of this license field back to back (another change to the proposed struct), but I've updated it for now so the checks can go green. License will eventually be
Where x is what we land on as far as how to handle the complex expressions. There is possibly another path where we keep license as |
As far as I know, that validates against the list. I was much more looking for parsing all licenses it finds, handling approved ones (
I just expanded what already was there, but it pretty much was a subset of spdx anyways. The |
Ah, that explains it. Thanks for handling that. I will resolve the other CI issues.
I looked at it, I didn't see any tools there to parse licenses into some of those complex structures. 🤷♂️ |
@deitch you're right -- I missed the fact that basically none of the license parsing in the GitHub library is exported, only a couple functions like Satisfies. An alternate library is: https://github.com/kyoh86/go-spdx but it doesn't look especially well maintained. |
If we're going to roll our own parser, we should probably at least handle:
|
I think I see a test case where they are handling the OR expression here: And then here is one testing against the paren case (): Something like:
Which complex structure are not covered - the node structure they provide seems pretty comprehensive at first glance, but you have a lot more context loaded for this one so I defer to what you found! |
which we do, see here. It has potential robustness improvements, but it does all of that. |
|
How interesting, they did I as a binary tree, rather than lists of If their thing was published, I would use it. But it wasn't, so I built it from scratch. Mostly. I had a need for something like this years ago in my JavaScript days, so the logic was somewhere back in my head. I think we miss the |
Do you understand why some of the encoding tests are failing? The outputs - json and xml etc. - are nearly identical, but the package IDs are different, e.g.: < <component bom-ref="pkg:deb/debian/package-2@2.0.1?package-id=342b18b33e30ed0" type="library">
---
> <component bom-ref="pkg:deb/debian/package-2@2.0.1?package-id=db4abfe497c180d3" type="library"> I see a few such cases that repeat a lot. |
It appears to have something to do with these |
@deitch I'm seeing a number of issues unrelated to the |
109023a
to
892c067
Compare
Yup, working on those.
So just |
@deitch I typically run individual tests in IntelliJ after updating the code to have a |
It didn't solve them, and the diffs in GHA output are a little confusing. 😁 I guess I can change it manually? Where would I do so? There also are template version issues here 6.2.0 vs 7.0.0 |
@deitch I would suggest hitting pause on this PR for just a bit -- we're having some internal discussions about how we would like to see changes in the SBOM structure and other things and I don't want you to spend a bunch of time working on something that might need to change |
892c067
to
1f3e5c0
Compare
OK, think I got those. Probably one or two small things left. |
@kzantow just saw this. Ah, well, and I was pretty proud of what we have done. Part of the reason I am trying so hard to get it in is that I had a chicken and egg. syft missed non-standard licenses, so I got in #1540 , but now it struggled with compound licenses. To be fair, those were broken before too, but syft merrily treated them as unintelligible and returned them as Either way, syft now either ignores any non-standard licenses (pre-1540), or breaks on some of them (post-1540). I wanted to try and get it to handle them sanely. Let me know what I can do to help move it along. FWIW, |
f969d1d
to
01ba9a6
Compare
@deitch you've done some really good work! we just need to have a bit of a conversation how all the license changes work together before jumping in. Thank you very much for your contributions! |
You are most welcome. I will admit some self-interest, in that I am working with client projects that have a need, but the best OSS has different people coming together to pull together. I will keep a close eye on this. |
If you are at all interested in the last failing unit test, it is the following. We have some places in the tests where we have an implied It is, quite literally, impossible to reconcile both of those. Either whitespace is a separator and, if there is no joiner, it is an implied |
3ad35af
to
8eb9f95
Compare
Hey @deitch I didn't want to leave this PR hanging too long -- we all are interested in getting a solution here. Would you be able to join the community meeting on Thursday this week? We are keen to discuss it along with the related changes that @spiffcs mentioned! If not, maybe we could have some more synchronous discussion on the community slack, if you had some time. |
Unfortunately, I will be unavailable at that time. I will message you on Slack. |
0407f38
to
3cd400f
Compare
Signed-off-by: Avi Deitcher <avi@deitcher.net>
3cd400f
to
fafe494
Compare
I rebased, but almost certainly broke a bunch of things. It looks like the schema version already was updated to 7.0.0, so I need to update it (7.0.1?). I unfortunately have no idea how to do that. |
Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com> Signed-off-by: Avi Deitcher <avi@deitcher.net>
fafe494
to
595701c
Compare
Oho! I figured that part out! |
@deitch I'll make more time for this today and get this into a state was we discussed offline and then ping you for feedback. Thank you again for making this PR and taking the time to talk with me about the things you need Syft to do to meet your needs when using it. |
Yup, sounds good. As long as it is out of the current broken state, and into some state where the output is parsable - either a structure representing all of the licenses and their relationships, or a simple string that is standard but parsable (spdx format most likely) - it will be fine. |
* main: (47 commits) Deprecate config.yaml as valid config source; Add unit regression for correct config paths (anchore#1640) chore: Update syft bootstrap tools to latest versions. (anchore#1682) Update documentation: (anchore#1680) chore: Update Stereoscope to 7928713c391e20abaede6a029f4ce37b628a4c8b (anchore#1681) fix: reduce logging for bad dpkg lines (anchore#1675) fix ruby classifier (anchore#1678) feat: add shared dir for easier cleanup (anchore#1676) chore(deps): bump github.com/google/go-containerregistry (anchore#1672) chore(deps): bump actions/setup-go from 3 to 4 (anchore#1671) fix: move defer after error to protect panic case (anchore#1670) feat: add argocd, helm, kustomize and kubectl binary classifiers (anchore#1663) defer closing file (anchore#1668) fix: remove author contributing to javascript CPEs (anchore#1669) fix: more python matching support (anchore#1667) Update syft bootstrap tools to latest versions. (anchore#1666) feat: add ruby classifier (anchore#1665) Update syft bootstrap tools to latest versions. (anchore#1658) fix: improved Python binary detection (anchore#1648) fix: suppress some known incorrect vendor candidates for npm CPEs (anchore#1659) fix: sanitize SPDX LicenseRefs (anchore#1657) ... Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Attempt at a merge commit to main is done: cleaning up the tests and adding my changes on top to try and keep the diff as clean as possible |
@deitch long awaited, but now complete: ^ The above PR has been merged. We have not done a syft release yet with it included, but if you build from main you're now guaranteed to have valid SPDX expressions (singular id or complex) under the Licenses are no longer a Thanks a ton for bringing the syft license issue to our attention and we hope this is the first step in making it easier for consumers of syft to have more guarantees around what fields contain valid SPDX licenses. We've also made some changes to how our internal model translate licenses to the If you have any questions, comments, or suggestions for further changes you need feel free to reach out to me or the team. If you feel like this PR has been closed in error and the current solution does not expose the data you need also feel free to comment back and we can talk further! |
@spiffcs excellent! This really is pretty impressive. |
As discussed with @kzantow here, syft currently does not handle compound licenses, or anything joined with
OR
(as opposed toAND
).This changes that.
LogicalStrings
struct that has a joiner, simple elements and compound child elementsLogicalStrings
and back to stringLicenses []string
property in various places with a more complex compound expression structSome key points and questions:
internal
, but I am not convinced it shouldn't be in a different package; open to suggestionsLogicalStrings
struct. I think that is the right first step; it can be updated later.Ball back in the hands of the maintainers.