Fix ScanCodeSummarizer #999

qtomlinson · 2023-07-17T15:09:36Z

Fix a few cases where Scancode results provide license information.

In addition to content.summary.packages[0].declared_license, check content.summary.packages[0].license_expression
Concatenate the license only when its score >= 80, to be consistent with summarizing license info for files.
Add META/INF directory as license location for maven packages.

See commit messages for detailed explanation and test cases

qtomlinson · 2023-07-17T15:35:16Z

@mpcen ready for review

In scancode, packages[0].license_expression contains license information. This later becomes 'declared_license_expression'. Details see aboutcode-org/scancode-toolkit@ab677c6#diff-47cc909d82dee95ebbb1a3d3a8ed519ae75684072c8f4867b90056d66863f964 Based on documentation, 'declared_license_expression' is the 'primary license expression as determined from the declaration(s) of the authors of the package'. See see https://www.nexb.com/scancode-license-clarity-scoring/ When the existing logic fails to normalize, try to derive license information from packages[0].license_expression.

In some cases, CD (scancode summarizer) goes through the root files to derive license information. file.licenses is an array of licenses with varying matching scores. Currently, after the license file is found, we concat all the license expressions as declared license (even though some of them have low matching scores). However, when the licenses are tallied from the files, those licenses with score lower than 80 are filtered out. This inconsistency causes the declared license does not show up in the discovered license (>80 score) in some cases. Refactor the logic, so that only license score above 80 is considered for declared license. Test case: https://clearlydefined.io/definitions/composer/packagist/colinmollenhour/cache-backend-redis/1.14.4

In one of the workflow, scancode summarizer goes through the root files to look for license. The filtering for root files is based on case sensitive file path matching. In maven packaging, META/INF directory is used often. Add this to the license location to search. Test case: https://clearlydefined.io/definitions/maven/mavencentral/org.flywaydb/flyway-core/9.20.0 https://dev.clearlydefined.io/definitions/maven/mavencentral/org.flywaydb/flyway-core/7.7.2 Task: clearlydefined#846

qtomlinson · 2023-10-10T22:00:59Z

@mpcen @jeffwilcox rebased and ready for review. Any feedback would be very much appreciated!

jeffwilcox · 2023-10-10T22:07:00Z

providers/summary/scancode.js

@@ -178,6 +189,10 @@ class ScanCodeSummarizer {
      .filter(e => e)
  }

+  _isMatchedLicense(license) {
+    return license.score >= 80


where is this threshold documented/defined for the community understanding - is that a scancode native value that means "definitely the match"?

[I read the more detailed commit messages]

Are there any situations where this >= 80 threshold might produce an inaccurate resulting license that we could point to a risk on this?

Thanks for the feedback!
The exiting logic in scancode summarizer includes licenses with lower scores in the declared license, but lists only licenses scoring above 80 for the files. For example, the declared licenses for https://clearlydefined.io/definitions/composer/packagist/colinmollenhour/cache-backend-redis/1.14.4 are BSD-3-Clause AND GPL-2.0-only. However discovered license is only BSD-3-Clause (GPL-2.0-only is filtered out due to lower score). This inconsistency causes confusion. My attempt here is to make declared license and discovered license consistent and remove licenses scoring less than 80 from the declared license.

As to what is the appropriate score threshold to avoid risk of incorrect license:

in scancode data with versions prior to 3.0.0, licenses with score >= 90 are considered as declared license (_getLicenseByFileName)

Since version 3.0.0 of scancode, is_license_text flag from scancode is used. The “is_license_text” flag is set to true for files that contain mostly license texts and notices. Based on documentation, "mostly means over 90% of the content of the file."
According to this documentation (prior to v32 update, 30.1.0 is currently used in the crawler), the license objects for each detected license in the files has "is_license_text" attribute as well. I will adapt my solution to use the is_license_text flag on the detected license instead.

Upon further research, there are four flags determined at each license match level: "is_license_text", "is_license_notice", "is_license_tag", "is_license_reference". This is also the order of precedence. Details see https://scancode-results-analyzer-fork-ayan.readthedocs.io/en/latest/how-analysis-is-performed/cases-incorrect-scans.html. is_license_text alone is not enough to determine whether it is the correct license match. Also see more discussion at aboutcode-org/scancode-toolkit#3082

I will refactor and reuse the logic for scancode prior to 3.0.0 to determine declared license by scoring >= 90. This way is more consistent within CD's implementation. @jeffwilcox What do you think?

Are there any situations where this >= 80 threshold might produce an inaccurate resulting license that we could point to a risk on this?

@jeffwilcox Without any fixes, there is currently a risk of inaccurate resulting license. Details see issue. Scancode file licenses (as discovered) are always filtered based on score (>= 80). For declared licenses, there is a filtering based on score (>=90) in place for ScanCode data with versions prior to 3.0.0. This filtering is missing for later versioned ScanCode data, resulting license matched with lower scores showing up in "declared", but filtered out in "discovered". Including the score filtering (>=90) will ensure more accurate license information, and also consistency in handling different versions of ScanCode data.

Task: clearlydefined/crawler#519

qtomlinson · 2023-11-02T23:24:46Z

@elrayle for review

pombredanne · 2023-11-06T19:22:13Z

Feel free to pull me and @AyanSinhaMahapatra in. We maintain ScanCode.
Note that filtering based on license score is likely to miss several detected licenses.

AyanSinhaMahapatra · 2023-11-06T19:39:32Z

where is this threshold documented/defined for the community understanding - is that a scancode native value that means "definitely the match"?

See https://scancode-toolkit.readthedocs.io/en/stable/reference/license-detection-reference.html for reference, we now have LicenseDetection objects (with tons of upgrades in license detection heuristics) which have a detection_log optionally which has info on license match accuracy. See https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/detection.py#L98 for possible values, there is some docs/docstrings but scattered, I'll improve on the docs here too. Note that this is similar to https://scancode-results-analyzer-fork-ayan.readthedocs.io/en/latest/how-analysis-is-performed/selecting-incorrect-unique.html#file-regions-with-incorrect-scans but has many improvements, as this was a prototype, which was improved a lot before being merged in scancode proper.

Also note that we do much better package and license summarization on a codebase-level https://scancode-toolkit.readthedocs.io/en/stable/cli-reference/scan-options-post.html#summary-option based on legalese/key files which would be useful also.

qtomlinson · 2023-11-07T23:44:08Z

@pombredanne @AyanSinhaMahapatra It is great to have feedback from the experts! Thanks for the insight and pointers on the new ScanCode data format. The references are extremely helpful for our ScanCode upgrade. I have included the related comments in the upgrade issue.

The current issue at hand deals with ScanCode output data from version 30.1.0. How to make declared license consistent with discovered licenses (issue)?

Options:

rescan the package when ScanCode is upgraded, or
see whether the harvested data has enough information where a license can be correctly derived without triggering a re-harvest.
Option2 is the cheaper one, and is being explored here.

With the introduction of is_license_text on the file level, file.is_license_text is used in CD to detect a license file. When there are multiple license matches for a license file, historically, licenses[i].score has been used as a filter (>=80). For example, for colinmollenhour-Cm_Cache_Backend_Redis-6661e3f/LICENSE, there are two license matches:
- BSD-3-Clause (score 95.71) and
- GPL 2.0 (score 11.72).
After filtering based on score, only BSD-3-Clause is displayed as the license for the file, and later in discovered licenses, in clearlydefined UI,

For the declared license, the filtering based on score is missing for recent versions of ScanCode data, resulting both BSD-3-Clause and GPL 2.0 are declared but only BSD-3-Clause is discovered. This is confusing to end users.

Without resorting to filtering based on a score threshold,

using licenses[i].matched_rule[j].is_license_text was also explored. It is found that in some cases, when licenseFile.is_license_text === true, licenses[i].matched_rule[j].is_license_text can be false for all matched licenses (is_license_notice matched, see test/fixtures/scancode/3.0.2/commons-clause.json).
using licenses[i].matched_rule[j].is_license_notice is not good either. For colinmollenhour-Cm_Cache_Backend_Redis-6661e3f/LICENSE, GPL 2.0 is a match by is_license_notice === true with score of 11.72.

It seems that score is a consistent measure of license match in data versioned <= 30.1.0. That is why filtering based on score >=90 (existing filter value for older versioned ScanCode data) for declared license is proposed, with filtering score >=80 as discovered license already in place.

Any alternatives?

qtomlinson · 2023-11-08T16:06:35Z

Note that filtering based on license score is likely to miss several detected licenses.

Yes. Filtering (for consistency) does increase the likelihood of NOASSERTION here. Those cases need curation or re-harvested using the updated ScanCode. Adjusting filtering threshold can be one way to reduce those cases. Filter for "discovered" license is set to be 80. What is a sensible threshold for "declared"? 90 is used for older versioned ScanCode data (prior to 3.0.0).

qtomlinson · 2023-11-08T17:30:09Z

providers/summary/scancode.js

-    if (!declaredLicense || declaredLicense === 'NOASSERTION') {
-      declaredLicense = this._getDeclaredLicense(scancodeVersion, harvested, coordinates)
+    if (!isDeclaredLicense(declaredLicense)) {
+      declaredLicense = this._readLicenseExpression(harvested) || declaredLicense


In the latest ScanCode format, "the codebase level packages has been also updated: license_expression -> declared_license_expression, also with it’s SPDX version, declared_license -> extracted_license_statement" from doc. Here in CD with 30.1.0 ScanCode data, declared_license is used to parse package level license information. In addition, package level license_expression (->declared_license_expression in the new format) can be utilized to provide license information. @pombredanne @AyanSinhaMahapatra Let me know if you have any feedback.

qtomlinson marked this pull request as ready for review September 19, 2023 21:39

qtomlinson added 3 commits October 10, 2023 14:48

qtomlinson force-pushed the qt/scancode-license branch from e1e549f to 5a68d1f Compare October 10, 2023 21:54

jeffwilcox reviewed Oct 10, 2023

View reviewed changes

qtomlinson added 2 commits October 12, 2023 15:41

Ignore license matches scoring below 90 in declared licenses

875e274

Utilize declared_license.license in ScanCode result

482992a

Task: clearlydefined/crawler#519

qtomlinson mentioned this pull request Nov 7, 2023

update processing of scancode output to work with the latest version #1006

Open

qtomlinson changed the title ~~Fix NoAssertion license cases~~ Fix ScanCodeSummarizer Nov 8, 2023

qtomlinson commented Nov 8, 2023

View reviewed changes

qtomlinson marked this pull request as draft November 15, 2023 23:56

This was referenced Nov 16, 2023

Fix NoAssertion cases in maven and sourcearchive typed packages #1017

Merged

Fix NoAssertion cases by using information in scancode results #1018

Merged

qtomlinson mentioned this pull request Apr 5, 2024

Add new summarizer for recent ScanCode versions #1056

Merged

qtomlinson closed this Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ScanCodeSummarizer #999

Fix ScanCodeSummarizer #999

qtomlinson commented Jul 17, 2023 •

edited

Loading

qtomlinson commented Jul 17, 2023

qtomlinson commented Oct 10, 2023

jeffwilcox Oct 10, 2023

jeffwilcox Oct 10, 2023

qtomlinson Oct 11, 2023 •

edited

Loading

qtomlinson Oct 12, 2023

qtomlinson Nov 2, 2023 •

edited

Loading

qtomlinson commented Nov 2, 2023

pombredanne commented Nov 6, 2023

AyanSinhaMahapatra commented Nov 6, 2023

qtomlinson commented Nov 7, 2023 •

edited

Loading

qtomlinson commented Nov 8, 2023 •

edited

Loading

qtomlinson Nov 8, 2023 •

edited

Loading

Fix ScanCodeSummarizer #999

Fix ScanCodeSummarizer #999

Conversation

qtomlinson commented Jul 17, 2023 • edited Loading

qtomlinson commented Jul 17, 2023

qtomlinson commented Oct 10, 2023

jeffwilcox Oct 10, 2023

Choose a reason for hiding this comment

jeffwilcox Oct 10, 2023

Choose a reason for hiding this comment

qtomlinson Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

qtomlinson Oct 12, 2023

Choose a reason for hiding this comment

qtomlinson Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

qtomlinson commented Nov 2, 2023

pombredanne commented Nov 6, 2023

AyanSinhaMahapatra commented Nov 6, 2023

qtomlinson commented Nov 7, 2023 • edited Loading

qtomlinson commented Nov 8, 2023 • edited Loading

qtomlinson Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

qtomlinson commented Jul 17, 2023 •

edited

Loading

qtomlinson Oct 11, 2023 •

edited

Loading

qtomlinson Nov 2, 2023 •

edited

Loading

qtomlinson commented Nov 7, 2023 •

edited

Loading

qtomlinson commented Nov 8, 2023 •

edited

Loading

qtomlinson Nov 8, 2023 •

edited

Loading