Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

BUG: pypi dependencies are not consistently identified #1598

Open
DennisClark opened this issue Feb 25, 2025 · 9 comments
Open

BUG: pypi dependencies are not consistently identified #1598

DennisClark opened this issue Feb 25, 2025 · 9 comments
Assignees
Labels
bug Something isn't working high priority

Comments

@DennisClark
Copy link
Member

DennisClark commented Feb 25, 2025

I used the scan_single_package pipeline to scan the package (source code) available at
https://github.com/aboutcode-org/scancode.io/archive/refs/tags/v34.9.5.tar.gz
and SCIO v34.9.5 found 52 dependencies but many of them are very incomplete, and have what appear to be a valid PURL but do not have any Download URL or License. The problem packages are all from pypi.

It seems very strange that SCIO is able to identify a specific, valid version of these problem packages, which can be found online, but it is not getting a Download URL, suggesting that there are special aspects of the pypi repo that it is not handling very well. Please see the attached scan results.

scancodeio_scio-v34.9.5.json

@tdruez
Copy link
Contributor

tdruez commented Feb 25, 2025

@DennisClark The scan_single_package is based on the scancode toolkit --package scanner.

    -p, --package     Scan <input> for application package and dependency
                      manifests, lockfiles and related data.

The results are loaded in the SCIO database as provided by the toolkit.

All the 52 dependencies are generated from the setup.cfg https://github.com/aboutcode-org/scancode.io/blob/main/setup.cfg#L51-L113 install_requires option.
Only a name and a version are available there, no download URLs, and no licenses.

To get those values, a dynamic resolution is required. This is not part of the scan_single_package but rather available as the resolve_dependencies pipeline.
In the case of Python manifests the https://github.com/aboutcode-org/python-inspector library will be used to collect the data.

The problem is that the SCIO codebase/app uses a setup.cfg file to declare dependencies and it is not supported by the python-inspector tool. I've entered #1313 a while ago to raise this problem.

Currently running the resolve_dependencies on the SCIO codebase results in a run failure.

@AyanSinhaMahapatra what's your take on this?

@tdruez
Copy link
Contributor

tdruez commented Feb 25, 2025

As a workaround, I've extracted the list of dependencies into a requirements.txt file, as it should be supported by the python-inspector tool.

Using this file as the input for the resolve_dependencies:DynamicResolver,StaticResolver fails as well:

[RequirementInformation(requirement=<Requirement('lief==0.15.1')>, parent=Candidate(name='rust-inspector', version=<Version('0.1.0')>, extras=set()))]

Traceback:
  File "/opt/scancodeio/aboutcode/pipeline/__init__.py", line 199, in execute
    step(self)
  File "/opt/scancodeio/scanpipe/pipelines/resolve_dependencies.py", line 87, in get_packages_from_manifest
    self.resolved_packages = resolve.get_packages(
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/scanpipe/pipes/resolve.py", line 76, in get_packages
    if packages := resolve_manifest_resources(resource, package_registry):
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/scanpipe/pipes/resolve.py", line 52, in resolve_manifest_resources
    packages = get_packages_from_manifest(resource.location, package_registry) or []
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/scanpipe/pipes/resolve.py", line 176, in get_packages_from_manifest
    resolved_packages = resolver(input_location=input_location)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/scanpipe/pipes/resolve.py", line 195, in resolve_pypi_packages
    inspector_output = resolve_dependencies(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/.venv/lib/python3.12/site-packages/python_inspector/api.py", line 289, in resolve_dependencies
    resolution, purls = resolve(
                        ^^^^^^^^
  File "/opt/scancodeio/.venv/lib/python3.12/site-packages/python_inspector/api.py", line 353, in resolve
    resolved_dependencies, packages = get_resolved_dependencies(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/.venv/lib/python3.12/site-packages/python_inspector/api.py", line 394, in get_resolved_dependencies
    resolver_results = resolver.resolve(
                       ^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/.venv/lib/python3.12/site-packages/resolvelib/resolvers/resolution.py", line 515, in resolve
    state = resolution.resolve(requirements, max_rounds=max_rounds)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/.venv/lib/python3.12/site-packages/resolvelib/resolvers/resolution.py", line 456, in resolve
    raise ResolutionImpossible(self.state.backtrack_causes)

@tdruez
Copy link
Contributor

tdruez commented Feb 25, 2025

@TG1999 See the above issues with the python-inspector tool. Can you provide some input?

@tdruez
Copy link
Contributor

tdruez commented Feb 25, 2025

Similar issue for the DejaCode requirements.txt.

We should not fail the whole pipeline for a single problematic entry in the resolution, but rather continue and log errors as ProjectMessage.

[RequirementInformation(requirement=<Requirement('django-rest-hooks==1.6.1')>, parent=None)]

Cause: Requirements conflict: <Requirement('django-rest-hooks==1.6.1')>

Traceback:
  File "/opt/scancodeio/aboutcode/pipeline/__init__.py", line 199, in execute
    step(self)
  File "/opt/scancodeio/scanpipe/pipelines/resolve_dependencies.py", line 87, in get_packages_from_manifest
    self.resolved_packages = resolve.get_packages(
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/scanpipe/pipes/resolve.py", line 76, in get_packages
    if packages := resolve_manifest_resources(resource, package_registry):
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/scanpipe/pipes/resolve.py", line 52, in resolve_manifest_resources
    packages = get_packages_from_manifest(resource.location, package_registry) or []
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/scanpipe/pipes/resolve.py", line 176, in get_packages_from_manifest
    resolved_packages = resolver(input_location=input_location)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/scanpipe/pipes/resolve.py", line 195, in resolve_pypi_packages
    inspector_output = resolve_dependencies(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/.venv/lib/python3.12/site-packages/python_inspector/api.py", line 289, in resolve_dependencies
    resolution, purls = resolve(
                        ^^^^^^^^
  File "/opt/scancodeio/.venv/lib/python3.12/site-packages/python_inspector/api.py", line 353, in resolve
    resolved_dependencies, packages = get_resolved_dependencies(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/.venv/lib/python3.12/site-packages/python_inspector/api.py", line 394, in get_resolved_dependencies
    resolver_results = resolver.resolve(
                       ^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/.venv/lib/python3.12/site-packages/resolvelib/resolvers/resolution.py", line 515, in resolve
    state = resolution.resolve(requirements, max_rounds=max_rounds)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/scancodeio/.venv/lib/python3.12/site-packages/resolvelib/resolvers/resolution.py", line 390, in resolve
    raise ResolutionImpossible(e.criterion.information) from e

@AyanSinhaMahapatra
Copy link
Member

Only a name and a version are available there, no download URLs, and no licenses.

@tdruez if there is a valid name and a version, there should be a download URL for any pypi package, as you pointed out, the dependency object only has purl fields and a couple other things.

To get those values, a dynamic resolution is required.

Or we can use a static resolver, if the dependencies are pinned, i.e. they have a version.

I'm thinking now that maybe it is better to have a add-on pipeline for dependency resolution than having a seperate pipeline, so it can be plugged in optionally after all the pipelines that detect package/dependency info (and deprecate the resolve_dependencies pipeline). What do you think?

@tdruez
Copy link
Contributor

tdruez commented Feb 25, 2025

if there is a valid name and a version, there should be a download URL for any pypi package, as you pointed out, the dependency object only has purl fields and a couple other things.

@AyanSinhaMahapatra could you expand on that? How do you generate a download URL for pypi package from a name/version?

@AyanSinhaMahapatra
Copy link
Member

@tdruez actually it's the source download URL for pypi, see https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/packagedcode/pypi.py#L2259 where we get the data API url, source download URL (repository_download_url) and repo homepage URL.

Note tha it could be also possible to write something minimal to get this from the URL field of API data for a project: https://pypi.org/pypi/scancode-toolkit/32.3.2/json, but this type of processing is currently only done in purldb so could be something we support with enrich_with_purldb pipeline

@tdruez
Copy link
Contributor

tdruez commented Feb 25, 2025

actually it's the source download URL for pypi, see https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/packagedcode/pypi.py#L2259 where we get the data API url, source download URL (repository_download_url) and repo homepage URL.

Let's take the first entry of the discovered dependencies pkg:pypi/aboutcode-hashid@0.2.0.

>>> get_pypi_urls("aboutcode-hashid", "0.2.0")
{'repository_homepage_url': 'https://pypi.org/project/aboutcode-hashid', 'repository_download_url': 'https://pypi.org/packages/source/a/aboutcode-hashid/aboutcode-hashid-0.2.0.tar.gz', 'api_data_url': 'https://pypi.org/pypi/aboutcode-hashid/0.2.0/json'}

>>> import requests
>>> response = requests.get("https://pypi.org/packages/source/a/aboutcode-hashid/aboutcode-hashid-0.2.0.tar.gz")
>>> response
<Response [404]>

I do not think we can generate reliable pypi download URLs from that static data.

Note tha it could be also possible to write something minimal to get this from the URL field of API data for a project: https://pypi.org/pypi/scancode-toolkit/32.3.2/json, but this type of processing is currently only done in purldb so could be something we support with enrich_with_purldb pipeline

Isn't it the whole purpose of the python-inspector though?

@AyanSinhaMahapatra
Copy link
Member

I do not think we can generate reliable pypi download URLs from that static data.

Actually yeah, I think this code is obsolete because they have changed their URLs a little bit, using the hashes for the download URL.

For example in https://files.pythonhosted.org/packages/b1/6a/d16cd64a786c3264266d740279af96113f849e9c5110bcc1891553fe5ae0/aboutcode_hashid-0.2.0.tar.gz they use:

The BLAKE2b-256 hash for that file which is b16ad16cd64a786c3264266d740279af96113f849e9c5110bcc1891553fe5ae0 which they partition and create the URL now.

But there is no way to get the URL without querying the API so this is not great.

Isn't it the whole purpose of the python-inspector though?

Yeah we do get all the metadata associated with python packages after doing the package resolving and that contains all the URLs, but we don't have inspectors for all the package ecosystems, and python-inspector is also quite limited in support of python package manifest so that's probably not a great general solution either 😅

tdruez added a commit that referenced this issue Mar 1, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working high priority
Projects
None yet
Development

No branches or pull requests

3 participants