Skip to content

Latest commit

 

History

History
104 lines (68 loc) · 8.3 KB

DevelopBlog.md

File metadata and controls

104 lines (68 loc) · 8.3 KB

Develop Records

This document records the develop process including failure and its sollutions. Also, we add what we thought to be useful for further analysis.

Draft of Ecosystem

To get an overview of the ecosystem, we first develop a demo version to draw the draft of the ecosystem as fast as possible to evaluate the effectiveness of the reserach problem, which is "is there any security issues that have spread through dependencies across the ecosystem".

Get data

For scrawlers, we use requests for getting htmls and BeautifulSoup for analyse. However, when it requests try to get "crates.io", it recieves 404, which I can't get rid of. So I changed it to Scrapy. But it can't get dynamic contents. So I use Scrapy-Splash to achieve this goal. But it still has technique problems that can't be solved within days.

Problems:

However, we find other tricy methods. Crates.io database API is open! See https://crates.io/data-access for help.

Finally, we directly used the data from official crates.io.

Steps

Now, we have got all the data as csv format. Next, we follow the steps provided by the official to import it into PostgreSQL database. The next question is how to display the structure of the graph and find useful information.

RustEcoMonitor

We have constructed 5 steps to Analyse the data.

  1. Start point(Get Data, Send requiry, etc):
    1. Get top 1% popular crates(~774 crates) and all crates in Github Adversory Database.
    2. Analyse how large 1% packages can spread by dependents. And how large can Aversories reach by dependents.
    3. Find the relationship between team and person.
    4. etc
  2. Version match: If you want to get specific information of the crate, you have to give a specific verison or range, which requires version match.
    1. Version parse: Tell if a specific version is within the given range.
    2. Version selection: In Cargo dependency, there will only be one version among given range that will finally be used accoring to dependency rules.
  3. DB query (Or online API): In this step, we have to look up to the database or send quiries through API to get crates metadata.
  4. Dependency Graph: Through version match, we can form the dependency graph or dependent graph from the start point.
  5. Paint: Display graph or other forms in a appropriate way.

Implementaion

For fast developing, we use two paths: Python and Rust. The former helps get a demo fast. The later forms a release tool.

Python: We write scripts to roughly do the Version Match, and use postgreDB to access data and get dependency graph in .csv form, which is then loaded to Gephi to get final graph. We select a bug in http, which clearly caused much propagation. For the bug you can refer to https://github.com/advisories/GHSA-x7vr-c387-8w57.

Full Rust Path

To get complete and accurate results, we reuse components in Cargo project.

  1. Version Parse: See project semver file tests/test_version_req.rs line 35 fn test_basic(). You can change codes as the pic below. Then run test and modify to find out how it works. After that, you can just import semver library for "Version parse". .

  2. Version selection: Before Cargo builds the file, it will first generates Cargo.lock which contains specific version of each dependency crate. So we have to find out the logic codes of the generation.

    1. cargo::core::dependency::Dependency::parse(): Attempt to create a Dependency from an entry in the manifest. Might help in Version Selection of a single crate. https://docs.rs/cargo/latest/cargo/core/dependency/struct.Dependency.html

    2. Dependency graph: Struct cargo::util::graph::Graph. I think that in cargo::core::Resolve, Graph stores all one-level dependency in Graph, and each in Dependency. We need pick up logic codes that generate full Resolve. https://docs.rs/cargo/latest/cargo/core/struct.Resolve.html#

    3. Detailed implementation can be found in src/cargo/core/resolver/encode.rs.This module contains all machinery necessary to parse a Resolve from a Cargo.lock as well as serialize a Resolve to a Cargo.lock. Doc: https://docs.rs/cargo/latest/cargo/core/resolver/index.html

    4. From all above files, finally I find src/cargo/ops/resolve.rs and src/cargo/ops/cargo_generate_lockfile.rs , which support cargo generate-lockfile cmd. From generate_lockfile in src/cargo/ops/cargo_generate_lockfile.rs we can learn how a dependency graph is generated and then modify the code to perform the functionality we want. A demo code to get dependecy graph can be find at Demo.

    5. In order to accelerate the indexing process, we use a local registry instead. For details please refer to rust-lang/cargo#9471. When running we still can see updating index, but according to update_index method in src/cargo/registry/mod.rs, it says:

      /// Updates the index.
      ///
      /// For a remote registry, this updates the index over the network. Local
      /// registries only check that the index exists.
      fn update_index(&mut self) -> CargoResult<()>;

      Detail codes please go to rust_deps.

    6. However, when using multi-thread to accelerate the resolving process, we find out that there is a package cache lock (see cargo/src/ops/resolve.rs - resolve_with_previous), which forces to run resolving in sequence (that’s bad). Since we are using a local index, and we guess no writability is needed to done the dependency resolving, we try to remove the lock. A trace route is cargo/src/ops/resolve.rs - resolve_with_previous -> cargo/src/cargo/core/resolver/mod.rs - resolve -> cargo/src/cargo/core/resolver/mod.rs - activate_deps_loop -> cargo/src/cargo/core/resolver/mod.rs - activate -> cargo/src/cargo/core/resolver/dep_cache.rs - build_deps -> cargo/src/cargo/core/resolver/dep_cache.rs - query -> cargo/src/cargo/core/registry.rs - query -> cargo/src/cargo/core/registry.rs - ensure_loaded -> cargo/src/cargo/core/registry.rs - load -> …… It’s a little bit hard I think, thus I’m considering some other ways to remove the lock.

    7. Notice the annotation in cargo/src/cargo/util/config/mod.rs - acquire_package_cache_lock, we find out the lock can be read only, which means it can be shared. So I change ~/.cargo/.package-cache to readonly. It seems to work at first, but crashed due to reason I don’t quite understand. Then I start to think: can each thread use it’s own cache? It turns out working. But what’s strange is that it runs even slower, and appears relevance with number of jobs.

  3. DB query(Or Online API): In cargo/core/resolver/encode.rs fn into_resolve().

  4. Related Cargo doc: src/doc/contrib/src/architecture/files.md. Some tips for you: Search "Cargo.lock" in vscode, you may find related information.

Useful tools

Now, we've got some great tools as follows:

Useful discussion: