Skip to content

Commit 9ded34a

Browse files
committed
Auto merge of #10807 - dtolnay-contrib:sha256, r=weihanglo
Apply GitHub fast path even for partial hashes ### What does this PR try to resolve? As flagged in #10079 (comment), it's not great to assume that git SHAs would always be 40-character hex strings. In the future they will be longer. > Git is on a long-term trajectory to move to SHA256 hashes ([current status](https://lwn.net/SubscriberLink/898522/f267d0e9b4fe9983/)). I suppose when that becomes available/the default it's possible for a 40-digit hex-encoded hash not to be the full hash. Will this fail for that case? The implementation from #10079 fails in that situation because it turns dependencies of the form `{ git = "…", rev = "[…40 hex…]" }` into fetches with a refspec `+[…40 hex…]:refs/commit/[…40 hex…]`. That only works if the 40 hex digits are the *full* long hash of your commit. If it's really a prefix ("short hash") of a 64-hex-digit SHA-256 commit hash, you'd get a failure that resembles: ```console error: failed to get `dependency` as a dependency of package `repro v0.0.0` Caused by: failed to load source for dependency `dependency` Caused by: Unable to update https://github.com/rust-lang/dependency?rev=b30694b4d9b29141298870b7993e9aee10940524 Caused by: revspec 'b30694b4d9b29141298870b7993e9aee10940524' not found; class=Reference (4); code=NotFound (-3) ``` This PR updates the implementation so that Cargo will curl GitHub to get a resolved long commit hash *even if* the `rev` specified for the git dependency in Cargo.toml already looks like a SHA-1 long hash. ### Performance considerations ⛔ This reverses a (questionable, negligible) benefit of #10079 of skipping the curl when `rev` is a long hash and is not already present in the local clone. These curls take 200-250ms on my machine. 🟰 We retain the much larger benefit of #10079 which comes from being able to precisely fetch a single `rev`, instead of fetching all branches and tags in the upstream repo and hoping to find the rev somewhere in there. This accounts for the entire performance difference explained in the summary of that PR. 🟰 We still skip the curl when `rev` is a **long hash** of a commit that is already previously fetched. 🥳 After this PR, we also curl and hit fast path when `rev` is a **short hash** of some upstream commit. For example `{ git = "https://github.com/rust-lang/cargo", rev = "b30694b4d9" }` would previously have done the download-all-branches-and-tags codepath because `b30694b4d9` is not a long hash. After this PR, the curl to GitHub informs us that `b30694b4d9` resolves to the long hash `b30694b4d9b29141298870b7993e9aee10940524`, and we download just that commit instead of all-branches-and-tags. ### How should we test and review this PR? I tested with the following dependency specs, using `/path/to/target/release/cargo generate-lockfile`. ```toml # Before: slow path (fetch all branches and tags; 70K git objects) # After: fast path (20K git objects) cargo = { git = "https://github.com/rust-lang/cargo", rev = "b30694b4d9b2914129" } ``` ```toml # Before and after: fast path cargo = { git = "https://github.com/rust-lang/cargo", rev = "b30694b4d9b29141298870b7993e9aee10940524" } ``` ```toml # Before and after: fast path cargo = { git = "https://github.com/rust-lang/cargo", rev = "refs/heads/rust-1.14.0" } ``` ```toml # Before and after: same error "revspec 'rust-1.14.0' not found" # You are supposed to use `branch = "rust-1.14.0"`, this is not considered a `rev` cargo = { git = "https://github.com/rust-lang/cargo", rev = "rust-1.14.0" } ``` I made sure these all work both with and without `rm -rf ~/.cargo/git/db/cargo-* ~/.cargo/git/checkouts/cargo-*` in between each cargo invocation. FYI `@djc`
2 parents 6da7267 + 36053de commit 9ded34a

File tree

1 file changed

+96
-26
lines changed

1 file changed

+96
-26
lines changed

src/cargo/sources/git/utils.rs

+96-26
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,15 @@ use crate::util::{network, Config, IntoUrl, MetricsCounter, Progress};
77
use anyhow::{anyhow, Context as _};
88
use cargo_util::{paths, ProcessBuilder};
99
use curl::easy::List;
10-
use git2::{self, ErrorClass, ObjectType};
10+
use git2::{self, ErrorClass, ObjectType, Oid};
1111
use log::{debug, info};
1212
use serde::ser;
1313
use serde::Serialize;
1414
use std::env;
1515
use std::fmt;
1616
use std::path::{Path, PathBuf};
1717
use std::process::Command;
18+
use std::str;
1819
use std::time::{Duration, Instant};
1920
use url::Url;
2021

@@ -759,11 +760,15 @@ pub fn fetch(
759760

760761
// If we're fetching from GitHub, attempt GitHub's special fast path for
761762
// testing if we've already got an up-to-date copy of the repository
762-
match github_up_to_date(repo, url, reference, config) {
763-
Ok(true) => return Ok(()),
764-
Ok(false) => {}
765-
Err(e) => debug!("failed to check github {:?}", e),
766-
}
763+
let oid_to_fetch = match github_fast_path(repo, url, reference, config) {
764+
Ok(FastPathRev::UpToDate) => return Ok(()),
765+
Ok(FastPathRev::NeedsFetch(rev)) => Some(rev),
766+
Ok(FastPathRev::Indeterminate) => None,
767+
Err(e) => {
768+
debug!("failed to check github {:?}", e);
769+
None
770+
}
771+
};
767772

768773
// We reuse repositories quite a lot, so before we go through and update the
769774
// repo check to see if it's a little too old and could benefit from a gc.
@@ -793,11 +798,10 @@ pub fn fetch(
793798
}
794799

795800
GitReference::Rev(rev) => {
796-
let is_github = || Url::parse(url).map_or(false, |url| is_github(&url));
797801
if rev.starts_with("refs/") {
798802
refspecs.push(format!("+{0}:{0}", rev));
799-
} else if is_github() && is_long_hash(rev) {
800-
refspecs.push(format!("+{0}:refs/commit/{0}", rev));
803+
} else if let Some(oid_to_fetch) = oid_to_fetch {
804+
refspecs.push(format!("+{0}:refs/commit/{0}", oid_to_fetch));
801805
} else {
802806
// We don't know what the rev will point to. To handle this
803807
// situation we fetch all branches and tags, and then we pray
@@ -994,45 +998,79 @@ fn init(path: &Path, bare: bool) -> CargoResult<git2::Repository> {
994998
Ok(git2::Repository::init_opts(&path, &opts)?)
995999
}
9961000

1001+
enum FastPathRev {
1002+
/// The local rev (determined by `reference.resolve(repo)`) is already up to
1003+
/// date with what this rev resolves to on GitHub's server.
1004+
UpToDate,
1005+
/// The following SHA must be fetched in order for the local rev to become
1006+
/// up to date.
1007+
NeedsFetch(Oid),
1008+
/// Don't know whether local rev is up to date. We'll fetch _all_ branches
1009+
/// and tags from the server and see what happens.
1010+
Indeterminate,
1011+
}
1012+
9971013
/// Updating the index is done pretty regularly so we want it to be as fast as
9981014
/// possible. For registries hosted on GitHub (like the crates.io index) there's
9991015
/// a fast path available to use [1] to tell us that there's no updates to be
10001016
/// made.
10011017
///
10021018
/// This function will attempt to hit that fast path and verify that the `oid`
1003-
/// is actually the current branch of the repository. If `true` is returned then
1004-
/// no update needs to be performed, but if `false` is returned then the
1005-
/// standard update logic still needs to happen.
1019+
/// is actually the current branch of the repository.
10061020
///
10071021
/// [1]: https://developer.github.com/v3/repos/commits/#get-the-sha-1-of-a-commit-reference
10081022
///
10091023
/// Note that this function should never cause an actual failure because it's
10101024
/// just a fast path. As a result all errors are ignored in this function and we
10111025
/// just return a `bool`. Any real errors will be reported through the normal
10121026
/// update path above.
1013-
fn github_up_to_date(
1027+
fn github_fast_path(
10141028
repo: &mut git2::Repository,
10151029
url: &str,
10161030
reference: &GitReference,
10171031
config: &Config,
1018-
) -> CargoResult<bool> {
1032+
) -> CargoResult<FastPathRev> {
10191033
let url = Url::parse(url)?;
10201034
if !is_github(&url) {
1021-
return Ok(false);
1035+
return Ok(FastPathRev::Indeterminate);
10221036
}
10231037

1038+
let local_object = reference.resolve(repo).ok();
1039+
10241040
let github_branch_name = match reference {
10251041
GitReference::Branch(branch) => branch,
10261042
GitReference::Tag(tag) => tag,
10271043
GitReference::DefaultBranch => "HEAD",
10281044
GitReference::Rev(rev) => {
10291045
if rev.starts_with("refs/") {
10301046
rev
1031-
} else if is_long_hash(rev) {
1032-
return Ok(reference.resolve(repo).is_ok());
1047+
} else if looks_like_commit_hash(rev) {
1048+
// `revparse_single` (used by `resolve`) is the only way to turn
1049+
// short hash -> long hash, but it also parses other things,
1050+
// like branch and tag names, which might coincidentally be
1051+
// valid hex.
1052+
//
1053+
// We only return early if `rev` is a prefix of the object found
1054+
// by `revparse_single`. Don't bother talking to GitHub in that
1055+
// case, since commit hashes are permanent. If a commit with the
1056+
// requested hash is already present in the local clone, its
1057+
// contents must be the same as what is on the server for that
1058+
// hash.
1059+
//
1060+
// If `rev` is not found locally by `revparse_single`, we'll
1061+
// need GitHub to resolve it and get a hash. If `rev` is found
1062+
// but is not a short hash of the found object, it's probably a
1063+
// branch and we also need to get a hash from GitHub, in case
1064+
// the branch has moved.
1065+
if let Some(local_object) = local_object {
1066+
if is_short_hash_of(rev, local_object) {
1067+
return Ok(FastPathRev::UpToDate);
1068+
}
1069+
}
1070+
rev
10331071
} else {
10341072
debug!("can't use github fast path with `rev = \"{}\"`", rev);
1035-
return Ok(false);
1073+
return Ok(FastPathRev::Indeterminate);
10361074
}
10371075
}
10381076
};
@@ -1065,18 +1103,50 @@ fn github_up_to_date(
10651103
handle.get(true)?;
10661104
handle.url(&url)?;
10671105
handle.useragent("cargo")?;
1068-
let mut headers = List::new();
1069-
headers.append("Accept: application/vnd.github.3.sha")?;
1070-
headers.append(&format!("If-None-Match: \"{}\"", reference.resolve(repo)?))?;
1071-
handle.http_headers(headers)?;
1072-
handle.perform()?;
1073-
Ok(handle.response_code()? == 304)
1106+
handle.http_headers({
1107+
let mut headers = List::new();
1108+
headers.append("Accept: application/vnd.github.3.sha")?;
1109+
if let Some(local_object) = local_object {
1110+
headers.append(&format!("If-None-Match: \"{}\"", local_object))?;
1111+
}
1112+
headers
1113+
})?;
1114+
1115+
let mut response_body = Vec::new();
1116+
let mut transfer = handle.transfer();
1117+
transfer.write_function(|data| {
1118+
response_body.extend_from_slice(data);
1119+
Ok(data.len())
1120+
})?;
1121+
transfer.perform()?;
1122+
drop(transfer); // end borrow of handle so that response_code can be called
1123+
1124+
let response_code = handle.response_code()?;
1125+
if response_code == 304 {
1126+
Ok(FastPathRev::UpToDate)
1127+
} else if response_code == 200 {
1128+
let oid_to_fetch = str::from_utf8(&response_body)?.parse::<Oid>()?;
1129+
Ok(FastPathRev::NeedsFetch(oid_to_fetch))
1130+
} else {
1131+
// Usually response_code == 404 if the repository does not exist, and
1132+
// response_code == 422 if exists but GitHub is unable to resolve the
1133+
// requested rev.
1134+
Ok(FastPathRev::Indeterminate)
1135+
}
10741136
}
10751137

10761138
fn is_github(url: &Url) -> bool {
10771139
url.host_str() == Some("github.com")
10781140
}
10791141

1080-
fn is_long_hash(rev: &str) -> bool {
1081-
rev.len() == 40 && rev.chars().all(|ch| ch.is_ascii_hexdigit())
1142+
fn looks_like_commit_hash(rev: &str) -> bool {
1143+
rev.len() >= 7 && rev.chars().all(|ch| ch.is_ascii_hexdigit())
1144+
}
1145+
1146+
fn is_short_hash_of(rev: &str, oid: Oid) -> bool {
1147+
let long_hash = oid.to_string();
1148+
match long_hash.get(..rev.len()) {
1149+
Some(truncated_long_hash) => truncated_long_hash.eq_ignore_ascii_case(rev),
1150+
None => false,
1151+
}
10821152
}

0 commit comments

Comments
 (0)