You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These will be used to harvest packages for coordinates of type rpm and provider mariner.
Fetcher
A new Mariner fetcher will be added that is configured with these repositories. The fetcher will cache the repositories repomd.xml metadata file and the packages sqlite database file, similar to debian's package file map caching.
On receipt of a request for cd:rpm/mariner/-/$name/$revision, the fetcher will query each repository package database file for a match. If it finds a match, it will pull the RPM from the repository, and then extract it using rpm2cpio and cpio.
_Note that there may be multiple matches, e.g "noarch" RPMs present in both the x86_64 and aarch64 repos. In this case the first one encountered is selected. _
If the RPM is a binary RPM, the source RPM URL is determined and included in the response, so that the clearlydefined service can use this as the source location URL.
Extracter
A new RPM extracter will be added. This runs the license detection tools over the extracted RPM. for binary RPMs this also determines the coordinates of the source package.
Considerations for adding a new harvest source
Discoverability
The packages are organized in a standard yum repository format, so can be discovered using yum/dnf/tdnf or any tool capable of parsing that format. The packages themselves are in RPM format.
Primary Source
Yes, this is the primary source
Reputability – is this repository operated by a reputable organization? What is the purpose behind running this repository? Is there an identifiable team that can be reached in the event of any issues?
Microsoft runs package.microsoft.com to host Linux packages. Mariner is our Linux distribution for Azure first-party services. I can find contact details for packages.microsoft.com admins.
Security – how secure is the repository? Is there a team that is available to handle issues in a timely manner when they arise? How fast do they respond to issues, such as when a security vulnerability is planted as a backdoor in a package?
The external sources are scanned by Mariner prior to building and packaging them.=, and the Mariner team provides CVE fixes for vulnerabilities discovered in packages.
Automation – does the repository support an API to support pulling of information? If not, is the package index organized in a schematized format that can programmatically queried using the package name and version and queried using HTTP(s). When using HTTP to mine data, ClearlyDefined should check for the existence of robots.txt or robot headers that indicate such mining is unacceptable. How much effort is it to automate the process?
There isn’t an HTTP API. Clients have to download the repository metadata files to discover what packages are available, similar to yum/dnf/tdnf.
Relationship – reach out to the organization that maintains the repository to indicate that ClearlyDefined wishes to harvest data from their repository, with an explanation on how harvesting is done, what the data is used for and how much additional traffic this could result in. Identify/Resolve any concerns and provide a contact from ClearlyDefined in the event they need to support in case of an issue.
I'll find contact details for packages.microsoft.com maintainers.
The text was updated successfully, but these errors were encountered:
Related to clearlydefined/clearlydefined#156.
I propose adding the base Mariner 2.0 RPM repositories as a new harvest source. Specifically:
These will be used to harvest packages for coordinates of type
rpm
and providermariner
.Fetcher
A new Mariner fetcher will be added that is configured with these repositories. The fetcher will cache the repositories repomd.xml metadata file and the packages sqlite database file, similar to debian's package file map caching.
On receipt of a request for
cd:rpm/mariner/-/$name/$revision
, the fetcher will query each repository package database file for a match. If it finds a match, it will pull the RPM from the repository, and then extract it usingrpm2cpio
andcpio
._Note that there may be multiple matches, e.g "noarch" RPMs present in both the x86_64 and aarch64 repos. In this case the first one encountered is selected. _
If the RPM is a binary RPM, the source RPM URL is determined and included in the response, so that the clearlydefined service can use this as the source location URL.
Extracter
A new RPM extracter will be added. This runs the license detection tools over the extracted RPM. for binary RPMs this also determines the coordinates of the source package.
Considerations for adding a new harvest source
Discoverability
The packages are organized in a standard yum repository format, so can be discovered using yum/dnf/tdnf or any tool capable of parsing that format. The packages themselves are in RPM format.
Primary Source
Yes, this is the primary source
Reputability – is this repository operated by a reputable organization? What is the purpose behind running this repository? Is there an identifiable team that can be reached in the event of any issues?
Microsoft runs package.microsoft.com to host Linux packages. Mariner is our Linux distribution for Azure first-party services. I can find contact details for packages.microsoft.com admins.
Security – how secure is the repository? Is there a team that is available to handle issues in a timely manner when they arise? How fast do they respond to issues, such as when a security vulnerability is planted as a backdoor in a package?
The external sources are scanned by Mariner prior to building and packaging them.=, and the Mariner team provides CVE fixes for vulnerabilities discovered in packages.
Automation – does the repository support an API to support pulling of information? If not, is the package index organized in a schematized format that can programmatically queried using the package name and version and queried using HTTP(s). When using HTTP to mine data, ClearlyDefined should check for the existence of robots.txt or robot headers that indicate such mining is unacceptable. How much effort is it to automate the process?
There isn’t an HTTP API. Clients have to download the repository metadata files to discover what packages are available, similar to yum/dnf/tdnf.
Relationship – reach out to the organization that maintains the repository to indicate that ClearlyDefined wishes to harvest data from their repository, with an explanation on how harvesting is done, what the data is used for and how much additional traffic this could result in. Identify/Resolve any concerns and provide a contact from ClearlyDefined in the event they need to support in case of an issue.
I'll find contact details for packages.microsoft.com maintainers.
The text was updated successfully, but these errors were encountered: