Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

New summary/primary Content Type prototype #1754

Open
mjherzog opened this issue Oct 10, 2019 · 2 comments
Open

New summary/primary Content Type prototype #1754

mjherzog opened this issue Oct 10, 2019 · 2 comments

Comments

@mjherzog
Copy link
Member

ScanCode currently reports the following 9 fields for determining the content type of a file:

  • MIME Type, - File Type, - Language
  • Is Binary, - Is Text, - Is Archive, - Is Media, - Is Source, - Script

When organizing a codebase for analysis, it would be useful to consolidate this data into a single field that indicates the level and type of analysis to apply. The focus should be on identifying source and binary files "program/code" files that are copyright-able and likely to be licensed. There will be many specialized file types (e.g. only for a proprietary software program) that will be not be covered by this feature.

In the prototype phase we will use a ScanCode plugin to analyze a Scan and annotate it with a new primary_content_type field that designates the primary Content Type (primary for analysis) in the format: Language-Type.

An initial test can be to report a primary Content Type to distinguish between SourceCode and Scripts for files written in Programming Languages (e.g. Python ,Ruby) that have both. There should be patterns in the current set of Content Type data to make this distinction.

johnmhoran added a commit that referenced this issue Oct 11, 2019
* Includes LICENSE.txt which seems to have been added during the
  './configure.bat' process.

Signed-off-by: John M. Horan <johnmhoran@gmail.com>
johnmhoran added a commit that referenced this issue Oct 11, 2019
* Added basic directory structure and basic files.
* Initial test code in plugin_primary_content.py.

Signed-off-by: John M. Horan <johnmhoran@gmail.com>
johnmhoran added a commit that referenced this issue Oct 11, 2019
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@johnmhoran
Copy link
Member

As an initial step to get familiar with contenttype.py and related files/processes, I will:

  1. create a test to verify that .awk files are identified as text files and as scripts, and modify the code as needed to reach that result; and
  2. add an attribute and related test(s) to identify build scripts.

johnmhoran added a commit that referenced this issue Oct 14, 2019
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
johnmhoran added a commit that referenced this issue Oct 14, 2019
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
johnmhoran added a commit that referenced this issue Oct 14, 2019
Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@johnmhoran johnmhoran changed the title Prototype new summary/primary Content Type prototype New summary/primary Content Type prototype Oct 15, 2019
johnmhoran added a commit that referenced this issue Oct 15, 2019
* Also added 'make.bat' and related test to 'is_build()' property.

Signed-off-by: John M. Horan <johnmhoran@gmail.com>
@pombredanne
Copy link
Member

The code behind this issue and branch is now in https://github.com/nexB/typecode but there never was a PR for https://github.com/nexB/scancode-toolkit/compare/1754-primary-content
Should be find a way to migrate this to typecode?

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

4 participants