-
-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Scancode confuses files with similar names #3648
Comments
This is a really surprising bug! Thanks for the report |
@bennati @pombredanne
which is matched to https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/rules/lgpl-2.1_and_gpl-2.0_and_gpl-3.0_3.RULE and references the other COPYING files. And it brings in the license detected at See scan results below (scanned with latest SCTK develop) See the match in question:
it mentions The |
@AyanSinhaMahapatra so if I may recap, you are saying that the detection is correct, and that the latest version helps better understand the detection details? |
Yes @pombredanne :) That's correct |
Thank you guys, I opened an issue in the ORT project to support this new feature oss-review-toolkit/ort#8190 |
I have to say that I find the implementation to be very confusing from a user perspective. There are plenty of tools out there that parse ScanCode result JSON files, and so far all of them could have relied on lines from matches to belong to the file / path they are nested under. But now, suddenly, additional info is added that easily goes unnoticed and basically says "eh, wait, this match actually comes from a different file despite being listed here"! Also, could you explain the general idea behind such references a bit more? Where's the benefit to refer from |
Oh, and on top of that, it seems there currently (with ScanCode 32.0.8) is no way to distinguish "real findings" from "reference findings" as there is no |
@sschuberth could you kindly explain more what is confusing here? License references to other files are pretty common and encountered widely across codebases. And there are generally a few cases and see #3547 for more details on this.
In both these cases, it is important to go to the referenced file and check for license information there, and adjust the license expression accordingly where it was referenced. Now when we are changing a license expression based on results in some other file (which is extremely important as this can be unknown/wrong, and a major source of false positives we get reports for), we need to have some clue that something was modified based on results in another file. Otherwise it would be more confusing where we detected the license expression at. So this is an improvement over previous versions where we were not following references, and we are also providing clues on how and from where we got the license matches from. If you're not interested in these matches, or only want to show matches which are from the same file, you can filter based on This feature of following references to resolve cases like these has been there for more than two years now, even before the recent license detection upgrade and output format change in v32. See #2616 where the initial implementation was added. Now we just provide more helpful info on which file this originated from, so that downstream tools can use this to show origin properly. It would have been nice to include this right at the start when we implemented following references, but we didn't get any feedback on this then. |
There is also some discussion with lot more details on #3547 where we are considering when to carry over referenced matches and when not to. Please chime in there too if you want to give feedback. |
And this |
Agreed. Hence I'm doubting the usefulness of tracking the place where a reference is being made if the target of the reference is included in the scan.
I was wondering why you say only "in some cases", but I guess you have cases in mind where the reference points to a file that is not included in the scan like "see license at http://my.server/LICENSE.txt". In such a case I see how it could be useful track this is a reference findings to an unknown license. While ScanCode may not look at the contents of http://my.server/LICENSE.txt, a human inspecting the scan results might do so.
I was preparing to do that, but that does not work with results generated by ScanCode 32.0.8 as it does not provide the
Interesting. So we probably have quite a few wrongly associated license findings in our databases, and it just didn't occur to us until now, where we ran into a case where the end line was exceeded.
I agree that the addition of |
@AyanSinhaMahapatra can you confirm my statement here, that before the addition of |
@sschuberth sorry about the late reply:
The usefulness is that we are resolving
Not exactly.
Yeah, this is merged in develop and we are working towards a release: v32.1 sooner than later.
Yeah that's correct, we have not had any feedback to point in this direction in the couple years since it's release. When we had feedback, we implemented it. And this was an improvement anyway, which we are building upon to reduce false positives.
I mean above that there is no straightforward/easy way to do this, but it could be done potentially with rescanning the same scan target, if it's possible to do that. One would have to check the associated rule for each match, and if there are any referenced_filenames for that rule. And for a scan if there referenced filenames in any of the detected matches, and that file is present in the codebase, do a rescan for the same and use the new results. |
Nah, that would be too involving, I believe. Thanks for your answers and insights! |
Let's close this now that the ScanCode 32.1.0 release as the |
See [1] for context and the test data used. [1]: aboutcode-org/scancode-toolkit#3648 Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
See [1] for context and the test data used. [1]: aboutcode-org/scancode-toolkit#3648 Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
See [1] for context and the test data used. [1]: aboutcode-org/scancode-toolkit#3648 Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
@AyanSinhaMahapatra, @sschuberth would it make sense to you to add a CLI option to toggle the feature for resolving the references? |
I'm not sure, as changing that CLI option would invalidate the scan storage on the ORT side. So I'd probably rather continue filtering the results like we do. |
See [1] for context and the test data used. [1]: aboutcode-org/scancode-toolkit#3648 Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
I believe that these two approaches lead to different scan results. If the feature flag was added we would keep the "license-ref" findings, while with the filtering neither the license ref findings nor the inlined findings are present. I could imagine that some users might want to have findings for such references - and then a toggle probably could indeed make sense. |
You lost me now. Anyway, I feel this is a discussion we should have in the ORT community, not at the ScanCode side. |
See [1] for context and the test data used. [1]: aboutcode-org/scancode-toolkit#3648 Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
@AyanSinhaMahapatra as part of a discussion in ORT it came up that we now seem to have at least two kinds of reference findings for licenses in ScanCode:
How do these relate to each other? In particular, do 2. also contain Thanks for any insights. |
Scanning a folder containing only two files
COPYING
andCOPYING.LGPLv2.1
(seefiles.zip )
Scancode detects license LGPL in file
COPYING
at line 502, butCOPYING
has only 66 lines whileCOPYING.LGPLv2.1
has 502.Scancode output is in the attachment.
This issue can be reproduced with Scancode 32.0.8 but not with 31.2.6
The text was updated successfully, but these errors were encountered: