This document records the develop process including failure and its sollutions. Also, we add what we thought to be useful for further analysis.
To get an overview of the ecosystem, we first develop a demo version to draw the draft of the ecosystem as fast as possible to evaluate the effectiveness of the reserach problem, which is "is there any security issues that have spread through dependencies across the ecosystem".
For scrawlers, we use requests
for getting htmls and BeautifulSoup
for analyse. However, when it requests
try to get "crates.io", it recieves 404, which I can't get rid of. So I changed it to Scrapy
. But it can't get dynamic contents. So I use Scrapy-Splash
to achieve this goal. But it still has technique problems that can't be solved within days.
Problems:
- Splash docker fails: https://blog.csdn.net/whatday/article/details/86762264
- How to use scrapy-splash: https://www.youtube.com/watch?v=mTOXVRao3eA
- Relative path resolution: Unsolved
- Timeout: Unsolved
However, we find other tricy methods. Crates.io database API is open! See https://crates.io/data-access for help.
Finally, we directly used the data from official crates.io.
Now, we have got all the data as csv format. Next, we follow the steps provided by the official to import it into PostgreSQL database. The next question is how to display the structure of the graph and find useful information.
We have constructed 5 steps to Analyse the data.
- Start point(Get Data, Send requiry, etc):
- Get top 1% popular crates(~774 crates) and all crates in Github Adversory Database.
- Analyse how large 1% packages can spread by dependents. And how large can Aversories reach by dependents.
- Find the relationship between team and person.
- etc
- Version match: If you want to get specific information of the crate, you have to give a specific verison or range, which requires version match.
- Version parse: Tell if a specific version is within the given range.
- Version selection: In Cargo dependency, there will only be one version among given range that will finally be used accoring to dependency rules.
- DB query (Or online API): In this step, we have to look up to the database or send quiries through API to get crates metadata.
- Dependency Graph: Through version match, we can form the dependency graph or dependent graph from the start point.
- Paint: Display graph or other forms in a appropriate way.
For fast developing, we use two paths: Python and Rust. The former helps get a demo fast. The later forms a release tool.
Python: We write scripts to roughly do the Version Match, and use postgreDB to access data and get dependency graph in .csv form, which is then loaded to Gephi to get final graph. We select a bug in http, which clearly caused much propagation. For the bug you can refer to https://github.com/advisories/GHSA-x7vr-c387-8w57.
To get complete and accurate results, we reuse components in Cargo project.
-
Version Parse: See project
semver
filetests/test_version_req.rs line 35 fn test_basic()
. You can change codes as the pic below. Thenrun test
and modify to find out how it works. After that, you can just importsemver
library for "Version parse"..
-
Version selection: Before Cargo builds the file, it will first generates
Cargo.lock
which contains specific version of each dependency crate. So we have to find out the logic codes of the generation.-
cargo::core::dependency::Dependency::parse()
: Attempt to create a Dependency from an entry in the manifest. Might help in Version Selection of a single crate. https://docs.rs/cargo/latest/cargo/core/dependency/struct.Dependency.html -
Dependency graph:
Struct cargo::util::graph::Graph
. I think that incargo::core::Resolve
, Graph stores all one-level dependency inGraph
, and each inDependency
. We need pick up logic codes that generate full Resolve. https://docs.rs/cargo/latest/cargo/core/struct.Resolve.html# -
Detailed implementation can be found in
src/cargo/core/resolver/encode.rs
.This module contains all machinery necessary to parse aResolve
from aCargo.lock
as well as serialize aResolve
to aCargo.lock
. Doc: https://docs.rs/cargo/latest/cargo/core/resolver/index.html -
From all above files, finally I find
src/cargo/ops/resolve.rs
andsrc/cargo/ops/cargo_generate_lockfile.rs
, which supportcargo generate-lockfile
cmd. Fromgenerate_lockfile
insrc/cargo/ops/cargo_generate_lockfile.rs
we can learn how a dependency graph is generated and then modify the code to perform the functionality we want. A demo code to get dependecy graph can be find at Demo. -
In order to accelerate the indexing process, we use a local registry instead. For details please refer to rust-lang/cargo#9471. When running we still can see
updating index
, but according toupdate_index
method insrc/cargo/registry/mod.rs
, it says:/// Updates the index. /// /// For a remote registry, this updates the index over the network. Local /// registries only check that the index exists. fn update_index(&mut self) -> CargoResult<()>;
Detail codes please go to rust_deps.
-
However, when using multi-thread to accelerate the resolving process, we find out that there is a package cache lock (see
cargo/src/ops/resolve.rs - resolve_with_previous
), which forces to run resolving in sequence (that’s bad). Since we are using a local index, and we guess no writability is needed to done the dependency resolving, we try to remove the lock. A trace route iscargo/src/ops/resolve.rs - resolve_with_previous
->cargo/src/cargo/core/resolver/mod.rs - resolve
->cargo/src/cargo/core/resolver/mod.rs - activate_deps_loop
->cargo/src/cargo/core/resolver/mod.rs - activate
->cargo/src/cargo/core/resolver/dep_cache.rs - build_deps
->cargo/src/cargo/core/resolver/dep_cache.rs - query
->cargo/src/cargo/core/registry.rs - query
->cargo/src/cargo/core/registry.rs - ensure_loaded
->cargo/src/cargo/core/registry.rs - load
-> …… It’s a little bit hard I think, thus I’m considering some other ways to remove the lock. -
Notice the annotation in
cargo/src/cargo/util/config/mod.rs - acquire_package_cache_lock
, we find out the lock can be read only, which means it can be shared. So I change~/.cargo/.package-cache
to readonly. It seems to work at first, but crashed due to reason I don’t quite understand. Then I start to think: can each thread use it’s own cache? It turns out working. But what’s strange is that it runs even slower, and appears relevance with number of jobs.
-
-
DB query(Or Online API): In
cargo/core/resolver/encode.rs fn into_resolve()
. -
Related Cargo doc:
src/doc/contrib/src/architecture/files.md
. Some tips for you: Search "Cargo.lock" in vscode, you may find related information.
Now, we've got some great tools as follows:
- PostgreSQL(Applied): Database that Cargo used.
- VSCode Plugin(Applied): "PostgreSQL", very helpful.
- Gephi(Applied): Display graphs and networks. It is very powerful and renders awesome pictures.
- Gephi can't run in java 11: Use newest nightly version https://github.com/gephi/gephi
- Alternative solution(not applied): https://linuxanteportas.wordpress.com/2020/12/25/running-gephi-on-fedora-33/
- No graph is shown: open the 'Window' menu and select 'Graph
- Install PostgreSQL(Applied): https://www.digitalocean.com/community/tutorials/how-to-install-and-use-postgresql-on-ubuntu-20-04
- Remote Access to pgAdmin4(Not applied): https://blog.logrocket.com/setting-up-a-remote-postgres-database-server-on-ubuntu-18-04/
- Can't connect to DB in python: https://stackoverflow.com/questions/18664074/getting-error-peer-authentication-failed-for-user-postgres-when-trying-to-ge
- Can't find package after using pip3: try
python3 -m pip install xx
Useful discussion: