Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Proposal: Add crawler for Mariner RPM packages #490

Open
tofay opened this issue Oct 12, 2022 · 0 comments
Open

Proposal: Add crawler for Mariner RPM packages #490

tofay opened this issue Oct 12, 2022 · 0 comments

Comments

@tofay
Copy link

tofay commented Oct 12, 2022

Related to clearlydefined/clearlydefined#156.

I propose adding the base Mariner 2.0 RPM repositories as a new harvest source. Specifically:

These will be used to harvest packages for coordinates of type rpm and provider mariner.

Fetcher

A new Mariner fetcher will be added that is configured with these repositories. The fetcher will cache the repositories repomd.xml metadata file and the packages sqlite database file, similar to debian's package file map caching.

On receipt of a request for cd:rpm/mariner/-/$name/$revision, the fetcher will query each repository package database file for a match. If it finds a match, it will pull the RPM from the repository, and then extract it using rpm2cpio and cpio.

_Note that there may be multiple matches, e.g "noarch" RPMs present in both the x86_64 and aarch64 repos. In this case the first one encountered is selected. _

If the RPM is a binary RPM, the source RPM URL is determined and included in the response, so that the clearlydefined service can use this as the source location URL.

Extracter

A new RPM extracter will be added. This runs the license detection tools over the extracted RPM. for binary RPMs this also determines the coordinates of the source package.

Considerations for adding a new harvest source

Discoverability

The packages are organized in a standard yum repository format, so can be discovered using yum/dnf/tdnf or any tool capable of parsing that format. The packages themselves are in RPM format.

Primary Source

Yes, this is the primary source

Reputability – is this repository operated by a reputable organization? What is the purpose behind running this repository? Is there an identifiable team that can be reached in the event of any issues?

Microsoft runs package.microsoft.com to host Linux packages. Mariner is our Linux distribution for Azure first-party services. I can find contact details for packages.microsoft.com admins.

Security – how secure is the repository? Is there a team that is available to handle issues in a timely manner when they arise? How fast do they respond to issues, such as when a security vulnerability is planted as a backdoor in a package?

The external sources are scanned by Mariner prior to building and packaging them.=, and the Mariner team provides CVE fixes for vulnerabilities discovered in packages.

Automation – does the repository support an API to support pulling of information? If not, is the package index organized in a schematized format that can programmatically queried using the package name and version and queried using HTTP(s). When using HTTP to mine data, ClearlyDefined should check for the existence of robots.txt or robot headers that indicate such mining is unacceptable. How much effort is it to automate the process?

There isn’t an HTTP API. Clients have to download the repository metadata files to discover what packages are available, similar to yum/dnf/tdnf.

Relationship – reach out to the organization that maintains the repository to indicate that ClearlyDefined wishes to harvest data from their repository, with an explanation on how harvesting is done, what the data is used for and how much additional traffic this could result in. Identify/Resolve any concerns and provide a contact from ClearlyDefined in the event they need to support in case of an issue.

I'll find contact details for packages.microsoft.com maintainers.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant