-
Notifications
You must be signed in to change notification settings - Fork 11
When should we next squash the index? #47
Comments
The actual script: the scriptset -ex
now=`date '+%Y-%m-%d'`
git fetch origin
git reset --hard origin/master
head=`git rev-parse HEAD`
git push -f git@github.com:rust-lang/crates.io-index $head:refs/heads/snapshot-$now
msg=$(cat <<-END
Collapse index into one commit
Previous HEAD was $head, now on the \`snapshot-$now\` branch
More information about this change can be found [online] and on [this issue]
[online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440
[this issue]: https://github.com/rust-lang/crates-io-cargo-teams/issues/47
END
)
new_rev=$(git commit-tree HEAD^{tree} -m "$msg")
git push \
git@github.com:rust-lang/crates.io-index \
$new_rev:refs/heads/master \
--force-with-lease=refs/heads/master:$head Edit: to include the critical only requires push access to the crates.io-index, which any admin of the rust-lang GitHub organization has (and probably more). I think it'd be best to do some measurements here directly correlated with the metrics we care about. The original rationale for squashing was that initial clones took quite a long time downloading so much history. As a result I would suspect that we should establish thresholds along the lines of "how big is the download and how much would we save with a squash"? |
I'm also able to do it, and bors of course can (dunno if bors is an admin). I think that's it though. This was discussed at the crates.io meeting. Here were the key points.
The main unresolved questions, which we'd like to get answers from the Cargo team on, are:
My personal answer to those questions, which does not represent consensus among any team(s) are:
|
A follow up to @joshtriplett suggestion. To clone the index as is So apparently we can get git to do this correctly! (Others should check if they are getting the same results.) The thing I tried https://github.com/Eh2406/crates.io-index/commit/65419fd5f5b9758b95fa08f207276639b1426e43 is to add a new squash commit on top of the existing one from last time. I did not make a script just did it manually. It may be sufficient to just share the same root commit, if someone wants to give that a try. |
Looks like it works with the root in common, using The root can be found with |
For my own personal takes on some of the unresolved questions:
I don't have any problem with losing communication about this, I don't think it's really all that important especially now that it went so smoothly the first time. I do have a slightly different concern though. I think it would be a failure mode of Cargo if the index were automatically rolled up every day (defeating the purpose of delta updates), and having a fully automated process may cause us to not realize we're getting close to that situation. I am, however, very much in favor of automation. So to allay my concern I would request that a notification of some form be sent out to interested team members when a squash happens. (aka I just want an email of some form)
I would personally measure this in megabytes of data to download rather then either metric you mentioned, but commits are likely a good proxy for the megabytes being downloaded. My ideal metric would be something like "we shave 100MB off a clean download of the index", and the 100 number there is pulled out of thin air and could be more like 50 or something like that.
I think the first index squash went from roughly 90MB to 10MB (ish) for a clean initial download. Along those lines I'd say that a squash should save at least 70MB before squashing. |
@alexcrichton One question, if git can download a roll up in |
AFAIK git just downloads objects and doesn't do any diffing at the fetch layer. Delta updates work because most indexes have a huge shared history. If we roll into one commit frequently there's no shared history so git will keep downloading the entire new history, which would be fresh each time. So to answer your question, I don't believe git can have any sort of delta update when the history is changed and so I would still consider it a failure mode. |
For users who already have the latest version of the index, Git will generally see that the tree object for the single squashed commit is identical to the tree object it already has (since it has the same hash), so it will only donwload the single new commit object. So another solution may be to always keep, say, the last month's worth of commits in the history, and only squash the bits that are older than one month. All users who have updated in the month before squashing will be able to download deltas, and only users with an even older version of the index will have to redownload it in full. When squashing the old commits, all commits on top of them will have to be rewritten, so users will have to redownload the commit objects. However, commit objects hardly contain any data, and the associated tree objects are identical, so they won't be retransmitted. I did some experiments for this approach, and got somewhat mixed results with what Git is able to detect, but I believe it is possible to make it work. It would require some work to figure out the details, though. |
We had some discussion in the crates.io Discord channel (can't figure out how to permalink it), and things aren't quite as easy as indicated in my previous comment. I may have time to do some experiments later this week, but I don't make any promises. |
link to the discussion: https://discordapp.com/channels/442252698964721669/448525639469891595/597888610376613901 |
We did not have time to discuss this at the Cargo meeting today. So we don't have any new answers for @sgrif.
I was thinking maybe we open and issue on the index repo and have the script add a comment there, then anyone interested (in teams or not) can subscribe to that issue to get notifications. I would want to look into @Nemo157 suggestions for how to get git not to download the history at all well before we start doing a squash every week.
>git clone -b master --single-branch https://github.com/rust-lang/crates.io-index.git
...
Receiving objects: 100% (297740/297740), 67.54 MiB | 5.79 MiB/s, done.
>git clone -b master --single-branch https://github.com/smarnach/crates.io-index
Cloning into 'crates.io-index'...
...
Receiving objects: 100% (36539/36539), 14.01 MiB | 5.75 MiB/s, done. So it looks like we save ~54 MiB today. Assuming a linear size per commit then we would hit 70 MiB saved at ~ 72K Commits. So it looks like people's instincts are approximately in the same ballpark. |
It sounds like we don't need to keep a window of commits on the main branch, and we just need to archive the squashed-away commits on an archive branch? And since the server has those available it can do deltas from those objects? That sounds perfect. |
We discussed this at the Cargo meeting today.
Yes! Several of us would like some form of notification when it happens, but it does not need to be in advance and we do not need to publicize the event.
We realized that it was hard to make a decision do to a bikeshed effect, we all had different opinions but not strong enough to convince anyone. So we decided whatever is easiest for you to set up. If you need someone to make a decision, A daly check if we are over the commit limit.
After some discussion @ehuss pointed out that it is already noticeable, and @nrc pointed out that we want to have the script do something the first time it runs. We don't want it to break things on some random day in 3 month when we have non of this paged in. So if it is time based then every 6 months, if it is commit based then 50k. Most importantly We can monitor it and adjust the threshold later if needed. We had some discussion of whether this will cause existing users to download the full index on each squash day. My understanding from our discussion with @Nemo157 and @smarnach on discord is that the current plan will not trigger a full download. The Github repo will always have a commit referencing all tree objects that the client will have, so Github will have what it needs to do a delta even when master has just been squashed. No git-gc can remove the tree objects as there used by a backup branch. @ehuss wanted to recheck to make sure that this works as hoped. |
Will move forward with a prototype that squashes when the commit count is >50k |
I've been doing some tests, and Alex's original script seems to work pretty well. I've tried with a copy fetched by cargo that is anywhere from 10 to 1,000 to 10,000 commits old, and it seemed to properly download just the minimum necessary. A fresh download (delete CARGO_HOME) from a squashed index is about a 15MB download, which uses about 16MB of disk space. Compare that to the current size which is about 73MB download using about 79MB of disk space. The only issue I see is that for existing users, it does not release the disk usage. The only way I've determined to delete the old references is to run:
Cargo currently has a heuristic where it automatically runs |
I'd be totally down for expanding Cargo's gc commands, and if Cargo can share indexes even across squashes that's even better! |
@ehuss looks (https://git-scm.com/docs/git-reflog) like the @sgrif what is the progress on the prototype? |
@sgrif this recently came up again on internals, wanted to ping again if you've got progress on a prototype? I don't mind running the script manually nowadays one more time before we get automation set up again. If I don't hear back from you in a week or so I'll go ahead and do that and we can continue along the automation track! |
Previous HEAD was e669e72, now on the `snapshot-2019-10-17` branch More information about this change can be found online: * https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 * rust-lang/crates-io-cargo-teams#47 * https://internals.rust-lang.org/t/re-squash-the-crates-io-repository/11121
Ok I briefly talked with @sgrif on IRC and the index has been squashed! We'll be sure to have automation for the next one :) |
It looks like the index has grown considerably since the last squash (looks like it is 75MB now, and can be squashed down to about 20MB). @rust-lang/crates-io is there any progress on automating the process? Is there anything I can do to help? If there are barriers to setting up a cron job, can someone run the script manually? |
I've re-squashed the index |
When you squash the index in the future, are you able squash it for, as an example, everything older than 1 week instead of every commit in the repo at the time its squashed? I only ask because I currently am using the commit history as a changes feed for the crates index and if all commits are squashed one day, i would potentially lose any changes since the last time my automated process checked the commit history. This would give me a week buffer to run it before losing any information |
I don't think so. A commit with a long history does not have the same hash as a commit with 1 week of history. So if you only walk master, your just going to see new commits that happen to do the same thing as the old commits but are not equal. The code to handle that, may as well be code to walk the backup branches, feels like the same level of complexity. |
If you just compare the trees rather than walking commits it should work fine (e.g. from looking at the code I think |
Looks like it may be that time once again. |
This was last squashed on 2020-08-04, so we will need to automate the squashing if we're looking at doing this every few months. |
Previous HEAD was 1b7e17a, now on the `snapshot-2020-11-20` branch More information about this change can be found [online] and on [this issue] [online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 [this issue]: rust-lang/crates-io-cargo-teams#47
@pietroalbini that was my original plan, but then I remembered that the deployed slug on Heroku doesn't include source/files from the git repo. With some tweaks something like |
Previous HEAD was a5dcd84, now on the `snapshot-2021-05-05` branch More information about this change can be found [online] and on [this issue] [online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 [this issue]: rust-lang/crates-io-cargo-teams#47
Ran a manual squash: rust-lang/crates.io-index@4a44357 |
This adds a background job that squashes the index into a single commit. The current plan is to manually enqueue this job on a 6 week schedule, roughly aligning with new `rustc` releases. Before deploying this, will need to make sure that the SSH key is allowed to do a force push to the protected master branch. This job is derived from a [script] that was periodically run by the cargo team. There are a few minor differences relative to the original script: * The push of the snapshot branch is no longer forced. The job will fail if run more than once on the same day. (If the first attempt fails before pushing a new root commit upstream, then retries should succeed as long as the snapshot can be fast-forwarded.) * The push of the new root commit to the origin no longer uses `--force-with-lease` to reject the force push if new commits have been pushed there in parallel. Other than the occasional manual changes to the index (such as deleting crates), background jobs have exclusive write access to the index while running. Given that such manual changes are rare, this job completes quickly, and such manual tasks should be automated too, this is low risk. The alternative is to shell out to git because `libgit2` (and thus the `git2` crate) do not yet support this portion of the protocol. [script]: rust-lang/crates-io-cargo-teams#47 (comment)
In today's crates.io team meeting, the team agreed that in terms of workload/coordination we have no concerns with scheduling an index squash every ~6 weeks. I have an initial implementation migrating the script into a background job at rust-lang/crates.io@a7efdcd. The main open item is working with infra to determine if we want to allow the SSH key used by the service to do a forced push to the repo or if that should be reserved for a special SSH key. Until now, the service has treated the index as fast-forward-only. |
This adds a background job that squashes the index into a single commit. The current plan is to manually enqueue this job on a 6 week schedule, roughly aligning with new `rustc` releases. Before deploying this, will need to make sure that the SSH key is allowed to do a force push to the protected master branch. This job is derived from a [script] that was periodically run by the cargo team. There are a few minor differences relative to the original script: * The push of the snapshot branch is no longer forced. The job will fail if run more than once on the same day. (If the first attempt fails before pushing a new root commit upstream, then retries should succeed as long as the snapshot can be fast-forwarded.) * The push of the new root commit to the origin no longer uses `--force-with-lease` to reject the force push if new commits have been pushed there in parallel. Other than the occasional manual changes to the index (such as deleting crates), background jobs have exclusive write access to the index while running. Given that such manual changes are rare, this job completes quickly, and such manual tasks should be automated too, this is low risk. The alternative is to shell out to git because `libgit2` (and thus the `git2` crate) do not yet support this portion of the protocol. [script]: rust-lang/crates-io-cargo-teams#47 (comment)
Previous HEAD was baed40a, now on the `snapshot-2021-06-23` branch More information about this change can be found [online] and on [this issue]. [online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 [this issue]: rust-lang/crates-io-cargo-teams#47
This adds a background job that squashes the index into a single commit. The current plan is to manually enqueue this job on a 6 week schedule, roughly aligning with new `rustc` releases. Before deploying this, will need to make sure that the SSH key is allowed to do a force push to the protected master branch. This job is derived from a [script] that was periodically run by the cargo team. There are a few minor differences relative to the original script: * The push of the snapshot branch is no longer forced. The job will fail if run more than once on the same day. (If the first attempt fails before pushing a new root commit upstream, then retries should succeed as long as the snapshot can be fast-forwarded.) * The push of the new root commit to the origin no longer uses `--force-with-lease` to reject the force push if new commits have been pushed there in parallel. Other than the occasional manual changes to the index (such as deleting crates), background jobs have exclusive write access to the index while running. Given that such manual changes are rare, this job completes quickly, and such manual tasks should be automated too, this is low risk. The alternative is to shell out to git because `libgit2` (and thus the `git2` crate) do not yet support this portion of the protocol. [script]: rust-lang/crates-io-cargo-teams#47 (comment)
Previous HEAD was ebab036, now on the `snapshot-2021-06-26` branch More information about this change can be found [online] and on [this issue]. [online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 [this issue]: rust-lang/crates-io-cargo-teams#47
Add a background job for squashing the index This adds a background job that squashes the index into a single commit. The current plan is to manually enqueue this job on a 6 week schedule, roughly aligning with new `rustc` releases. Before deploying this, will need to make sure that the SSH key is allowed to do a force push to the protected master branch. This job is derived from a [script] that was periodically run by the cargo team. Relative to the original script, the push of the snapshot branch is no longer forced. The job will fail if run more than once on the same day. (If the first attempt fails before pushing a new root commit upstream, then retries should succeed as long as the snapshot can be fast-forwarded.) [script]: rust-lang/crates-io-cargo-teams#47 (comment)
The background job to run the squash has been merged, and was just run. Squashed commit: rust-lang/crates.io-index@3804ec0 |
Previous HEAD was 4181c62, now on the `snapshot-2021-07-02` branch More information about this change can be found [online] and on [this issue]. [online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 [this issue]: rust-lang/crates-io-cargo-teams#47
The cargo index has been squashed again: rust-lang/crates.io-index@8fe6ce0 |
Previous HEAD was f954048, now on the `snapshot-2021-09-24` branch More information about this change can be found [online] and on [this issue]. [online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 [this issue]: rust-lang/crates-io-cargo-teams#47
I've started noticing that crates io index fetching is taking a while again on slow connections/cpus. It looks like we're at more commits (44k) than before we last squashed(34k). Is it time to schedule a new squash? |
Previous HEAD was 94b5429, now on the `snapshot-2021-12-21` branch More information about this change can be found [online] and on [this issue]. [online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 [this issue]: rust-lang/crates-io-cargo-teams#47
Thanks for reminder @adamncasey. The index has been squashed.
|
Previous HEAD was ba5efd5, now on the `snapshot-2022-03-02` branch More information about this change can be found [online] and on [this issue]. [online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 [this issue]: rust-lang/crates-io-cargo-teams#47
The index has been squashed. Previous HEAD was |
@jtgeibel I was wondering if you could look at squashing again. I'm not sure if that is in a cron job or if it is still manual. It looks like it has been about 4 months since the last squash. The index is currently 237MB which is about the largest I've ever seen it, which can take a considerable amount of time to clone and unpack. |
Previous HEAD was 075e7a6, now on the `snapshot-2022-07-06` branch More information about this change can be found [online] and on [this issue]. [online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 [this issue]: rust-lang/crates-io-cargo-teams#47
Thanks for the ping @ehuss, invoking the squash is still manual. We still need to automate the archiving (to the archive repo) and eventual deletion of the snapshot branches (from the main repo). Previous HEAD was |
@jtgeibel Just checking in again to see if we can get another squash. The index is currently over 150MB and 34434 commits and takes about a minute to clone on a fast-ish system. |
Previous HEAD was 31a1d8c, now on the `snapshot-2022-08-31` branch More information about this change can be found [online] and on [this issue]. [online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 [this issue]: rust-lang/crates-io-cargo-teams#47
Previous HEAD was This is the next to smallest snapshot in terms of commits. I just deleted a temporary branch that was left behind on the main repo, so it is possible we weren't getting optimal compression server side. I plan to remove the snapshot branch from the main repo in about 10 days. |
Last (only) time: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 we had 100k+ commits and we thought we weighted a little too long (given how smoothly it went), now we have 51k + ~1.5k/week.
The Cargo team discussed this today and we think we should do this soon. Not interrupt whatever you are working on, but when you have a chance. Who has the permissions to run that script? Is it just @alexcrichton?
As the index grows we should have a policy for when we plan to do the squash. When we have a policy we should plan to make a bot to ensure we follow it. It is reasonable to say that it is too soon. Or we could make a simple policy for now and grow it as we need. The Cargo team discussed a policy like "when we remember approximately every 3-6 months" or "... approximately at 50k commits" or "... approximately when the squash is half the size of the history"
The text was updated successfully, but these errors were encountered: