Scancode confuses files with similar names #3648

bennati · 2024-01-30T09:56:55Z

Scanning a folder containing only two files COPYING and COPYING.LGPLv2.1 (see
files.zip )

Scancode detects license LGPL in file COPYING at line 502, but COPYING has only 66 lines while COPYING.LGPLv2.1 has 502.

Scancode output is in the attachment.

This issue can be reproduced with Scancode 32.0.8 but not with 31.2.6

The text was updated successfully, but these errors were encountered:

pombredanne · 2024-01-31T09:26:44Z

This is a really surprising bug! Thanks for the report

AyanSinhaMahapatra · 2024-01-31T10:07:28Z

@bennati @pombredanne
We have the following lines in COPYING:

    The following license texts are included in the following files:
      - COPYING.LGPLv2.1: GNU Lesser General Public License version 2.1
      - COPYING.GPLv2: GNU General Public License version 2
      - COPYING.GPLv3: GNU General Public License version 3

which is matched to https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/rules/lgpl-2.1_and_gpl-2.0_and_gpl-3.0_3.RULE and references the other COPYING files.

And it brings in the license detected at COPYING.LGPLv2.1 present here.

See scan results below (scanned with latest SCTK develop)
copying-files.json

See the match in question:

            {
              "license_expression": "lgpl-2.1",
              "spdx_license_expression": "LGPL-2.1-only",
              "from_file": "copying-files/COPYING.LGPLv2.1",
              "start_line": 1,
              "end_line": 502,
              "matcher": "1-hash",
              "score": 100.0,
              "matched_length": 4288,
              "match_coverage": 100.0,
              "rule_relevance": 100,
              "rule_identifier": "lgpl-2.1.LICENSE",
              "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/lgpl-2.1.LICENSE",
            }

it mentions "from_file": "copying-files/COPYING.LGPLv2.1", which tells us this match has been taken from the respective file as that was referenced.

The from-file attribute was added recently at #3620 to help make referenced matches more verbose. See also related issue: #3547

pombredanne · 2024-02-01T06:55:46Z

@AyanSinhaMahapatra so if I may recap, you are saying that the detection is correct, and that the latest version helps better understand the detection details?

AyanSinhaMahapatra · 2024-02-01T08:42:45Z

Yes @pombredanne :) That's correct

bennati · 2024-02-01T09:15:53Z

Thank you guys, I opened an issue in the ORT project to support this new feature oss-review-toolkit/ort#8190

sschuberth · 2024-02-01T13:32:39Z

it mentions "from_file": "copying-files/COPYING.LGPLv2.1", which tells us this match has been taken from the respective file as that was referenced.

I have to say that I find the implementation to be very confusing from a user perspective. There are plenty of tools out there that parse ScanCode result JSON files, and so far all of them could have relied on lines from matches to belong to the file / path they are nested under. But now, suddenly, additional info is added that easily goes unnoticed and basically says "eh, wait, this match actually comes from a different file despite being listed here"!

Also, could you explain the general idea behind such references a bit more? Where's the benefit to refer from COPYING to COPYING.LGPLv2.1 if the latter also gets scanned on its own? Are we then actually getting duplicate findings for COPYING.LGPLv2.1, once as a reference as part of scanning COPYING, and then again from scanning COPYING.LGPLv2.1 directly?

sschuberth · 2024-02-01T13:51:58Z

Oh, and on top of that, it seems there currently (with ScanCode 32.0.8) is no way to distinguish "real findings" from "reference findings" as there is no from_file field yet, correct?

AyanSinhaMahapatra · 2024-02-01T14:33:27Z

@sschuberth could you kindly explain more what is confusing here?

License references to other files are pretty common and encountered widely across codebases. And there are generally a few cases and see #3547 for more details on this.

In a lot of cases there is no information about the license at the location where another file is referenced. This is an * unknown * license reference, and if we can successfully find the referenced file and detect some license there (let's say anything apache-2.0) we carry over the license matches and also make the license expression to be apache-2.0 and not unknown-license-reference.
Also in some cases we have information about the license both in the location where a file is referenced and the referenced file. Now there could be a mismatch between these two information, the file could say, * licensed under gpl, see LICENSE * but at LICENSE we get a different/more licenses.

In both these cases, it is important to go to the referenced file and check for license information there, and adjust the license expression accordingly where it was referenced. Now when we are changing a license expression based on results in some other file (which is extremely important as this can be unknown/wrong, and a major source of false positives we get reports for), we need to have some clue that something was modified based on results in another file. Otherwise it would be more confusing where we detected the license expression at.

So this is an improvement over previous versions where we were not following references, and we are also providing clues on how and from where we got the license matches from.

If you're not interested in these matches, or only want to show matches which are from the same file, you can filter based on from_file values and the path for a file.

This feature of following references to resolve cases like these has been there for more than two years now, even before the recent license detection upgrade and output format change in v32. See #2616 where the initial implementation was added. Now we just provide more helpful info on which file this originated from, so that downstream tools can use this to show origin properly. It would have been nice to include this right at the start when we implemented following references, but we didn't get any feedback on this then.

AyanSinhaMahapatra · 2024-02-01T14:35:47Z

There is also some discussion with lot more details on #3547 where we are considering when to carry over referenced matches and when not to. Please chime in there too if you want to give feedback.

AyanSinhaMahapatra · 2024-02-01T14:38:09Z

And this from_file attribute was added based on feedback at aboutcode-org/scancode.io#902 to enable downstream tools like SCIO to show license match origin correctly.

sschuberth · 2024-02-01T15:01:53Z

In a lot of cases there is no information about the license at the location where another file is referenced.

Agreed. Hence I'm doubting the usefulness of tracking the place where a reference is being made if the target of the reference is included in the scan.

Also in some cases we have information about the license both in the location where a file is referenced and the referenced file.

I was wondering why you say only "in some cases", but I guess you have cases in mind where the reference points to a file that is not included in the scan like "see license at http://my.server/LICENSE.txt". In such a case I see how it could be useful track this is a reference findings to an unknown license. While ScanCode may not look at the contents of http://my.server/LICENSE.txt, a human inspecting the scan results might do so.

If you're not interested in these matches, or only want to show matches which are from the same file, you can filter based on from_file values and the path for a file.

I was preparing to do that, but that does not work with results generated by ScanCode 32.0.8 as it does not provide the from_file field yet, right?

This feature of following references to resolve cases like these has been there for more than two years now

Interesting. So we probably have quite a few wrongly associated license findings in our databases, and it just didn't occur to us until now, where we ran into a case where the end line was exceeded.

And this from_file attribute was added based on feedback at nexB/scancode.io#902 to enable downstream tools like SCIO to show license match origin correctly.

I agree that the addition of from_file is an improvement, as it allows us to distinguish real matches from reference matches, which does not seem to have been possible before.

sschuberth · 2024-02-21T17:49:44Z

I agree that the addition of from_file is an improvement, as it allows us to distinguish real matches from reference matches, which does not seem to have been possible before.

@AyanSinhaMahapatra can you confirm my statement here, that before the addition of from_file, there was no way for parsers of the ScanCode JSON result file to understand that a finding does not actually come from the file it is associated with?

AyanSinhaMahapatra · 2024-02-21T18:41:04Z

@sschuberth sorry about the late reply:

Agreed. Hence I'm doubting the usefulness of tracking the place where a reference is being made if the target of the reference is included in the scan.

The usefulness is that we are resolving unknown detections for license references, to their intended license declaration, and tracking how we came to the conclusion, in a variety of cases. So, lots of reduced false positives. We cannot get rid of these kinds of unknowns any other way, short of not reporting anything at all, and since these could be in key files, package manifests etc., which determine the primary license for a package, it would be misleading to not report the declared license.

but I guess you have cases in mind where the reference points to a file that is not included in the scan like "see license at http://my.server/LICENSE.txt"

Not exactly.

These are license references which are not unknown. An example is the issue in question, where we have license information properly detected both at COPYING (plus a license reference is there) and COPYING.LGPLv2.1 (the file referred to) but we still chose to follow and check whether this is consistent. This is essentially what is discussed in Refine referenced_filenames in license rules #3547 and something which is being discussed more before being implemented.
We are planning to support license references to files which are not present in the codebase scaned, rather somewhere in the internet, like: http://github.com/nexB/license-expression/LICENSE.txt. We already support some of these which are popular by making the URL a license rule directly, but not everything can be added like this. Another case is nuget, where the license field for the packages sometimes has a reference to a file/webpage somewhere (we already detect the SPDX expressions there allright: Detect new NuGet license URLs with expressions #3037). But this would be as a step in scancode.io, or as an optional step in SCTK as this would require making network calls.

I was preparing to do that, but that does not work with results generated by ScanCode 32.0.8 as it does not provide the from_file field yet, right?

Yeah, this is merged in develop and we are working towards a release: v32.1 sooner than later.

I agree that the addition of from_file is an improvement, as it allows us to distinguish real matches from reference matches, which does not seem to have been possible before.

Yeah that's correct, we have not had any feedback to point in this direction in the couple years since it's release. When we had feedback, we implemented it. And this was an improvement anyway, which we are building upon to reduce false positives.

can you confirm my statement here, that before the addition of from_file, there was no way for parsers of the ScanCode JSON result file to understand that a finding does not actually come from the file it is associated with?

I mean above that there is no straightforward/easy way to do this, but it could be done potentially with rescanning the same scan target, if it's possible to do that. One would have to check the associated rule for each match, and if there are any referenced_filenames for that rule. And for a scan if there referenced filenames in any of the detected matches, and that file is present in the codebase, do a rescan for the same and use the new results.

sschuberth · 2024-02-21T19:21:59Z

it could be done potentially with rescanning the same scan target, if it's possible to do that.

Nah, that would be too involving, I believe.

Thanks for your answers and insights!

sschuberth · 2024-03-28T10:57:37Z

Let's close this now that the ScanCode 32.1.0 release as the from_file available which allows us to disambiguate the findings that actually stem from other files.

See [1] for context and the test data used. [1]: aboutcode-org/scancode-toolkit#3648 Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>

fviernau · 2024-04-02T11:07:49Z

@AyanSinhaMahapatra, @sschuberth would it make sense to you to add a CLI option to toggle the feature for resolving the references?

sschuberth · 2024-04-02T11:12:26Z

@AyanSinhaMahapatra, @sschuberth would it make sense to you to add a CLI option to toggle the feature for resolving the references?

I'm not sure, as changing that CLI option would invalidate the scan storage on the ORT side. So I'd probably rather continue filtering the results like we do.

See [1] for context and the test data used. [1]: aboutcode-org/scancode-toolkit#3648 Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>

fviernau · 2024-04-02T14:45:22Z

I'm not sure, as changing that CLI option would invalidate the scan storage on the ORT side. So I'd probably rather continue filtering the results like we do.

I believe that these two approaches lead to different scan results. If the feature flag was added we would keep the "license-ref" findings, while with the filtering neither the license ref findings nor the inlined findings are present. I could imagine that some users might want to have findings for such references - and then a toggle probably could indeed make sense.

sschuberth · 2024-04-02T14:49:29Z

I believe that these two approaches lead to different scan results.

You lost me now. Anyway, I feel this is a discussion we should have in the ORT community, not at the ScanCode side.

See [1] for context and the test data used. [1]: aboutcode-org/scancode-toolkit#3648 Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>

sschuberth · 2024-04-03T06:31:20Z

@AyanSinhaMahapatra as part of a discussion in ORT it came up that we now seem to have at least two kinds of reference findings for licenses in ScanCode:

Those that are identified by the new from_file field containing something different than path (modulo this issue).
References that contain "reference" as part of the LicenseRef name (AFAIK the only one is LicenseRef-scancode-unknown-license-reference).

How do these relate to each other? In particular, do 2. also contain from_file different from path, or are these exclusively for references external to the code base? Or phrased differently, if we take only findings where from_file == path, is there a risk to miss LicenseRef-scancode-unknown-license-reference findings?

Thanks for any insights.

bennati added the bug label Jan 30, 2024

bennati mentioned this issue Jan 30, 2024

Handle ScanCode findings that are actually references to other files oss-review-toolkit/ort#8190

Closed

sschuberth mentioned this issue Mar 24, 2024

Upgrade to ScanCode 32.1.0 oss-review-toolkit/ort#8457

Closed

sschuberth closed this as completed Mar 28, 2024

sschuberth mentioned this issue Mar 28, 2024

The new from_file field contains the name of the input directory even with --strip-root #3712

Open

sschuberth mentioned this issue Mar 28, 2024

ScanCode conflates gpl-2.0 with gpl-2.0-plus in license_detections #3713

Open

fviernau mentioned this issue Apr 2, 2024

Upgrade ScanCode to version 32.1.0 oss-review-toolkit/ort#8470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scancode confuses files with similar names #3648

Scancode confuses files with similar names #3648

bennati commented Jan 30, 2024

pombredanne commented Jan 31, 2024

AyanSinhaMahapatra commented Jan 31, 2024

pombredanne commented Feb 1, 2024

AyanSinhaMahapatra commented Feb 1, 2024

bennati commented Feb 1, 2024

sschuberth commented Feb 1, 2024

sschuberth commented Feb 1, 2024

AyanSinhaMahapatra commented Feb 1, 2024 •

edited

Loading

AyanSinhaMahapatra commented Feb 1, 2024

AyanSinhaMahapatra commented Feb 1, 2024

sschuberth commented Feb 1, 2024

sschuberth commented Feb 21, 2024

AyanSinhaMahapatra commented Feb 21, 2024 •

edited

Loading

sschuberth commented Feb 21, 2024

sschuberth commented Mar 28, 2024

fviernau commented Apr 2, 2024

sschuberth commented Apr 2, 2024

fviernau commented Apr 2, 2024

sschuberth commented Apr 2, 2024

sschuberth commented Apr 3, 2024

Scancode confuses files with similar names #3648

Scancode confuses files with similar names #3648

Comments

bennati commented Jan 30, 2024

pombredanne commented Jan 31, 2024

AyanSinhaMahapatra commented Jan 31, 2024

pombredanne commented Feb 1, 2024

AyanSinhaMahapatra commented Feb 1, 2024

bennati commented Feb 1, 2024

sschuberth commented Feb 1, 2024

sschuberth commented Feb 1, 2024

AyanSinhaMahapatra commented Feb 1, 2024 • edited Loading

AyanSinhaMahapatra commented Feb 1, 2024

AyanSinhaMahapatra commented Feb 1, 2024

sschuberth commented Feb 1, 2024

sschuberth commented Feb 21, 2024

AyanSinhaMahapatra commented Feb 21, 2024 • edited Loading

sschuberth commented Feb 21, 2024

sschuberth commented Mar 28, 2024

fviernau commented Apr 2, 2024

sschuberth commented Apr 2, 2024

fviernau commented Apr 2, 2024

sschuberth commented Apr 2, 2024

sschuberth commented Apr 3, 2024

AyanSinhaMahapatra commented Feb 1, 2024 •

edited

Loading

AyanSinhaMahapatra commented Feb 21, 2024 •

edited

Loading