Proposal: high level file classification #426

pombredanne · 2017-01-04T11:35:01Z

To support #377 and other scan-based deduction and related refinements, an important step is to "classify" the files in the codebase being scanned. This would mean defining a few high level buckets and heuristics to classify a file in a bucket.

With such classification, smarter results could be provided: for instance the license of documentation files or build scripts does not have the same impact as the license of the main code (and may often not be part of a build or redistributed software as used in a system or app).

I am opening this up for discussion to define the classifications. I think there should be as few classifications as possible. They could be part of a hierarchy, but flat is probably better and simpler.

Here is a first shot at what these classes could be:

main code: would be all the code proper that is effectively built and used when a piece of code is used.
build scripts: such as Makefiles, poms, CMake lists, etc
test code: any code that is used for testing the main code, either unit or integration or else. In many cases this is stored in a tree separated from the main code and often this is not part of the build meant to be used, but instead invoked during a build step (make check, or similar, etc)
doc: any code and documents that are documenting the code and often may not be part of the built code. This often includes generated documents for API docs.
assets and media: such as images, video, sounds, fonts, etc. often used in GUI and web apps. They have often different licenses and origins from the main code.
dev tools: these are scripts, binaries, packages, etc present in the codebase but meant to be used for development and not production. Frequently, their provenance and license may have little impact on the resulting licensing of the main built code.
metadata/metafiles: such as package manifests, LICENSE or COPYING files, etc that are describing top-level information for a codebase or a subset of it.
generated code: such as when using a parser generator such as Bison/lex, some ORM such as Hibernate or some WSDL or else: this may represent a large volume of code at times and may not have directly identified provenance which needs to be traced to the "descriptor" used to generate the code. This may contain or not injected code plugs under various license (such as a bison skeletton)

Note that a file may end up in more than one class... not sure this would be a good thing.

Beside this classification, determining if some file is deployed or not deployed as part of a production build and built vs. not built is another topic altogether which would not be covered explicitly here.

The text was updated successfully, but these errors were encountered:

steven-esser · 2017-01-10T23:39:28Z

I think pointing out or emphasizing metadata files like LICENSE or COPYING in scan results would be a great addition. When I am doing analysis of 3rd party stuff, these are the first things I look at and if a project takes the time to include these, they are almost always correct.

If these files appeared somewhere near the top of scan results where ever they are being viewed (html app or AboutCode manager), that would really be helpful during analysis.

steven-esser · 2017-04-11T22:45:30Z

@pombredanne Would this make more sense as another fileinfo scan field, or as an additional thing added on after the fact, like the scan_errors field is for each file?

pombredanne · 2017-04-19T17:55:26Z

@MaJuRG sorry for the late reply! a fileinfo field makes the most sense

pombredanne · 2018-02-09T08:58:42Z

from @mjherzog #873 which is moved here instead

We currently have several "file type" fields returned from a scan:

Type: either File or Directory
MIME Type
File Type
Binary
Text File
Archive File
Media File
Source File
Script File
Package Type

For this topic, I will ignore Type since this just covers File vs Directory and focus on files only. We need some simpler way to identify the file type in one field to facilitate filtering in AboutCode Manager and other tools. MIME Type and File Type each have pros and cons.

-In many cases MIME Type seems more useful because it summarizes the type a bit more - e.g. "text/x-shellscript" is probably more useful than corresponding File Types like "Bourne-Again shell script, ASCII text executable" and "POSIX shell script, ASCII text executable" because I primarily want to find all of the script files (which often do not have an extension).

On the other hand MIME Type seems to use "application/octet-stream" as a catch-all The "octet-stream" subtype is used to indicate that a body contains arbitrary binary data.) is not very helpful .

It may be the case that we could get the best result with a new Summary File Type field where the possible values are: Binary, Archive, Text, Media, Source or Script, but I am not sure whether a scanning will resolve to only one of these values (presumably we have multiple fields today because of some overlap).

The primary use case is that I want to easily filter for Binary and Source code files which are the primary targets for analysis. The secondary use case is to easily filter for chunks like Script or Media files. This will also be important for filtering DeltaCode results to set up alerts/warnings for code files, but ignore or lower the priority of changes to Script or Media files.

I reviewed some scans and noticed many shell script files show up as Text rather than Script so the current identification of Script: true/false is not going to help much.

pombredanne · 2018-03-23T19:38:55Z

Something to consider is ClearlyDefined facets. It would be best to align classifications with these.

See
https://github.com/clearlydefined/website/blob/2639d4ed878d199a2eb381fb3448d1b74875cd1f/src/components/FacetSelect.js#L10 and https://github.com/clearlydefined/clearlydefined/blob/8f58a9a216cf7c129fe2cf6abe1cc6f960535e0b/docs/clearly.md#facets

Also the notion of "scope" for dependencies is closely related. See https://github.com/heremaps/oss-review-toolkit/blob/master/model/src/main/kotlin/Scope.kt#L27

mjherzog · 2019-10-11T16:46:10Z

Some comments:

Fonts may need to be separate from other media/assets - they are almost always licensed.
We probably need to keep this separate from CD facets. Facets are not yet not well defined and do not know how this concept will evolve
The end goal is to support policies for analysis and conclusions where you have a primary Content Type that you can reference in a policy for what files you want to analyze and/or which licenses for a package are secondary because of the Content Type.

mjherzog · 2019-10-11T16:47:21Z

#1754 Prototype new summary/primary Content Type prototype

viragumathe5 · 2020-03-10T19:33:13Z

@pombredanne I really want to comment on this and
so to achieve this I think

From the last day, I was doing some searches on documentations of projects and all so if the directory is for documentation then it must have more files of markdown or HTML or YML files if the directory is of some plugins then it must have more numbers of script files and some tests which also includes some script files so we can create the datasets for the number of files which if get passed by that directory then we can decide the type of directory for that directory
Another way to do this is to check for all the types of format that specific files has and map them one by one from the directory
for eg.
let's say I am looking for the archive directory so I had created the script for it and I know all the format that can be given by any archive file then I will map them and check for it but still it will get crucial for us to state the type of the directory

So maybe 1st way would be easy to implement and sounds practical

pombredanne added enhancement file info labels Jan 4, 2017

mjherzog changed the title ~~Proposal: high level files classification~~ Proposal: high level file classification Apr 11, 2017

armijnhemel mentioned this issue Apr 20, 2017

Do not scan certain data-only data file #602

Open

pombredanne mentioned this issue Apr 21, 2017

Incorrect programming language reported for a config.h.in file #605

Open

pombredanne mentioned this issue Jul 26, 2017

Don't mark SVG files as source #702

Merged

pombredanne mentioned this issue Sep 22, 2017

Introduce scan "recipes" #767

Closed

pombredanne added this to the v3.0 milestone Oct 20, 2017

pombredanne mentioned this issue Feb 9, 2018

Need summary file type #873

Closed

pombredanne mentioned this issue Mar 14, 2018

Do not fail to scan broken package manifests (such as broken test files) [was: Successful scan outputs to stderr] #983

Closed

pombredanne mentioned this issue Mar 28, 2018

Proposal: Introduce configurable Rules for scan classification, summarization, refinements and inference #1012

Open

pombredanne mentioned this issue Apr 13, 2018

Scanning NPM packages that list SPDX IDs results in false positives #1032

Closed

pombredanne mentioned this issue Apr 27, 2018

RFC: introduce new "autoscan" mode #1049

Closed

pombredanne mentioned this issue Sep 17, 2018

Implement a license clarity score #1180

Closed

pombredanne modified the milestones: v3.0, v3.1 Nov 5, 2018

pombredanne modified the milestones: v3.1 Documentation, documentation, documentation, v3.2 Feb 16, 2019

pombredanne mentioned this issue Mar 13, 2019

Improve Programming language detection and classification #1445

Open

pombredanne added the summaries label Jun 11, 2019

pombredanne mentioned this issue Oct 14, 2019

Create plugin to determine file categories #1745

Open

pombredanne removed this from the v3.3 milestone Sep 24, 2021

pombredanne added this to the v32.1 milestone Jan 4, 2023

pombredanne mentioned this issue Jan 13, 2024

Meta Issue: File classification and categorization #3639

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: high level file classification #426

Proposal: high level file classification #426

pombredanne commented Jan 4, 2017 •

edited by mjherzog

Loading

steven-esser commented Jan 10, 2017

steven-esser commented Apr 11, 2017

pombredanne commented Apr 19, 2017

pombredanne commented Feb 9, 2018

pombredanne commented Mar 23, 2018

mjherzog commented Oct 11, 2019

mjherzog commented Oct 11, 2019 •

edited

Loading

viragumathe5 commented Mar 10, 2020

Proposal: high level file classification #426

Proposal: high level file classification #426

Comments

pombredanne commented Jan 4, 2017 • edited by mjherzog Loading

steven-esser commented Jan 10, 2017

steven-esser commented Apr 11, 2017

pombredanne commented Apr 19, 2017

pombredanne commented Feb 9, 2018

pombredanne commented Mar 23, 2018

mjherzog commented Oct 11, 2019

mjherzog commented Oct 11, 2019 • edited Loading

viragumathe5 commented Mar 10, 2020

pombredanne commented Jan 4, 2017 •

edited by mjherzog

Loading

mjherzog commented Oct 11, 2019 •

edited

Loading