bootstrap: retry `cargo` invocations if stderr contains a known pattern #134472

marcoieni · 2024-12-18T15:21:06Z

Why

Our CI auto builds sometimes fail for known reasons that are not related to the PRs we are trying to merge.

Most of the times, these errors are hard to understand and fix (or can't be fixed at all), decreasing the success rate of the auto builds for several weeks or months.
The impact is that we loose days of parallel compute time and hours of maintainers time that need to analyze the error message and reschedule the PRs in the merge queue.

Feature

We want to list the the stderr of the known issues we are aware of in the config.toml file that bootstrap uses. These patterns can be expressed as regex.
We want bootstrap to retry cargo invocations up to two times if stderr matches one of the listed patterns.

This would help reduce the failure rate of our CI because it would significantly reduce the percentage of jobs failing due to spurious errors.

The error messages need to be precise enough to avoid retrying cargo invocations over genuine problems.

Known error patterns can be found here. Not all of them can be listed.

As a start, we could just have 1 stderr string in the list (this one doesn't need to be a regex):

ranlib.exe: could not create temporary file whilst writing archive: no more archived files which is discussed in CI: spurious error building LLVM on mingw: ranlib.exe: could not create temporary file whilst writing archive: no more archived files #108227

Questions

Is config.toml the right place to put the known stderr patterns? In Zulip, Jieyou proposed introducing another file: retry-patterns.toml. I'll leave it to the bootstrap team to decide.
Which format do we use to write the stderr patterns in the config.toml file? For example, it can be an array of strings. It could also be an "object" if we want to customize how many times to retry per error message. I'll leave it to the boostrap team to decide.
how do we make sure these patterns are present in the config.toml used for CI? I'm not familiar with how the config.toml for the CI is generated.

Zulip links

idea proposed here
agreement reached here

The text was updated successfully, but these errors were encountered:

jieyouxu · 2024-12-18T15:23:18Z

Is config.toml the right place to put the known stderr patterns? In Zulip, Jieyou proposed introducing another file: retry-patterns.toml. I'll leave it to the bootstrap team to decide.

FTR config.toml is usually for local user config (CI sometimes writes their own config.toml) and that is specifically .gitignore'd, but I haven't thought too hard about the pros and cons of including it in the same config file.

jieyouxu · 2024-12-18T15:24:59Z

Notably, if we decide to pursue this approach, we need to be precise enough in the error patterns to not retry genuine failures.

marcoieni · 2024-12-18T15:27:12Z

Ok. Let us know when the bootstrap team is able to answer the questions listed, so that somebody can proceed with the implementation.
I could have some capacity to work on this in the week of 6 Jan for example 👍

jieyouxu · 2024-12-18T15:34:32Z

cc @rust-lang/bootstrap: what do we think about this?

Kobzol · 2024-12-18T16:03:35Z

This sounds reasonable, it would definitely be useful.

Note that implementing this directly in bootstrap might be tricky. We sometimes don't capture stderr of cargo (in fact I think that in most situations we don't capture it?), but rather redirect it to stderr of the bootstrap. Furthermore, we would probably need to do this on a per-Cargo invocation, and we invoke Cargo on many places. So to avoid doing these restarts on a lot of source locations inside bootstrap, which would be quite crazy, it would be better to somehow thread it through this, which would ideally do the restart automatically inside, without the rest of bootstrap having to think about it.

I wouldn't be opposed against doing this outside of bootstrap, tbh, by just wrapping the top-level run build CI step in a loop that would do rm -rf build && run.sh and then restart the loop N times if the generated stderr contains the invalid substrings. More specifically, I'm not sure if I agree with this comment, given the state of bootstrap's codebase 😆 But this has to be tried out.

Anyway, if we do it inside bootstrap, we can either put it into config.toml or add a CLI flag, that's probably not super important. We can modify config.toml on CI using the --set argument, we already do that on several places.

Btw, it looks like the linked issue is not a spurious failure, but rather a known issue (with a known fix). A better candidate might be the MSVC locking failures.

onur-ozkan · 2024-12-18T18:38:02Z

This could be useful only for CI, right? I think this should be done outside of the bootstrap using a wrapper script that handles this magic. It doesn't seem necessary to include it in the bootstrap as it's unrelated to its core purpose.

Also, how do we plan to handle error reports? If we silently skip these errors, we might forget to fix them which could increase build time for certain runners.

clubby789 · 2024-12-18T21:43:28Z

What if we just have the existing failure/RLA messages with a note that the step is being retried?

onur-ozkan · 2024-12-19T07:22:14Z

What if we just have the existing failure/RLA messages with a note that the step is being retried?

Yeah, that should be enough.

marcoieni · 2024-12-19T10:06:52Z

If we do this outside of bootstrap can we retry at the same level of granularity?

E.g. CI, could retry ./x.py --stage 2 test, but that would take a lot of time.

Instead, bootstrap knows the cargo command that failed, so it can retry only that command, which is faster, right?

jieyouxu · 2024-12-19T10:11:41Z

If we do this outside of bootstrap can we retry at the same level of granularity?

E.g. CI, could retry ./x.py --stage 2 test, but that would take a lot of time.

Instead, bootstrap knows the cargo command that failed, so it can retry only that command, which is faster, right?

If we don't invalidate the build cache (e.g. do it in the same job invocation), then bootstrap (via cargo) does not need to rebuild things that have already successfully built (modulo incompatible-rustflags bugs) (at the crate-level granularity I believe).

marcoieni · 2024-12-19T10:15:28Z

Right! So the cost is not 2x for sure. But we would need to re-run all the other tests that already passed, right? We could encounter the same error in another test that previously passed potentially. 🤔

Of course the final decision is yours btw and retrying outside of bootstrap might already be a good starting point. I'm just giving my 2 cents 👍

Kobzol · 2024-12-19T10:25:43Z

It will be probably quite faster due to the cache already being on disk, and also some tests will probably get skipped if the compiler didn't get rebuilt.

But even if it took the full CI time again, I don't think that's too bad. The important thing is that it would be automated, so people don't have to retry all the time, and that it would save CI resources, because only a single job would be restarted.

What might be problematic regarding longer CI time is the timeout, which is currently probably too short for a retry mechanism that reruns the full CI.

jieyouxu · 2024-12-19T10:27:00Z

But we would need to re-run all the other tests that already passed, right?

Not quite either! For compiletest-managed tests under tests/*, if the previous results were successful (or ignored), then the test will not be rerun if itself and its dependencies have not changed (modulo test dependency check bugs). Let's call this "lazy-rerun" property (I don't recall what a better term is on top of my head).

I believe other cargo test-managed tests (crate tests, tool tests) have similar lazy-rerun properties.

(Of course, they can be forced to rerun via --force-rerun.)

We could encounter the same error in another test that previously passed potentially.

That's indicative of a flaky test, which is going to be problematic anyway (in the previous lazy-rerun scheme, this violates the assumption that a test is expected to be reproducible).

Of course the final decision is yours btw and retrying outside of bootstrap might already be a good starting point. I'm just giving my 2 cents

@marcoieni to be clear, I'm not against giving it a shot inside bootstrap. I have not tried to implement this myself, so you may find it to be more tricky than expected / run into other issues that we have not and could not have predicted at this present time. You may also want to also give the non-bootstrap approach a rough experiment -- to compare which approach seems more maintainable/reliable. FWIW, I'm happy to review and discuss a spurious build retry implementation if you want to give it a shot. Without any concrete experiments, all we can do is roughly speculate what problems either approach may have.

marcoieni · 2024-12-19T10:34:50Z

Ok, thanks everyone!

It seems that the safest approach for now is experimenting in CI retrying ./x.py test if we see some known patterns 👍

Than we could move this mechanism to bootstrap itself if we think that there are some major benefits.

marcoieni added A-bootstrap-config Area: bootstrap `config.toml` and the config system T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Dec 18, 2024

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 18, 2024

jieyouxu added E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot. C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Dec 18, 2024

onur-ozkan added T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. A-CI Area: Our Github Actions CI labels Dec 19, 2024

marcoieni mentioned this issue Jan 14, 2025

Move CI large runners to free runners rust-lang/infra-team#189

Closed

10 tasks

jieyouxu mentioned this issue Jan 6, 2025

Meta tracking issue for spurious CI failures #133959

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bootstrap: retry `cargo` invocations if stderr contains a known pattern #134472

bootstrap: retry `cargo` invocations if stderr contains a known pattern #134472

marcoieni commented Dec 18, 2024

jieyouxu commented Dec 18, 2024

Uh oh!

jieyouxu commented Dec 18, 2024

Uh oh!

marcoieni commented Dec 18, 2024

Uh oh!

jieyouxu commented Dec 18, 2024

Uh oh!

Kobzol commented Dec 18, 2024

Uh oh!

onur-ozkan commented Dec 18, 2024

Uh oh!

clubby789 commented Dec 18, 2024

Uh oh!

onur-ozkan commented Dec 19, 2024

Uh oh!

marcoieni commented Dec 19, 2024

Uh oh!

jieyouxu commented Dec 19, 2024 •

edited

Loading

Uh oh!

marcoieni commented Dec 19, 2024

Uh oh!

Kobzol commented Dec 19, 2024

Uh oh!

jieyouxu commented Dec 19, 2024 •

edited

Loading

Uh oh!

marcoieni commented Dec 19, 2024

Uh oh!

bootstrap: retry cargo invocations if stderr contains a known pattern #134472

bootstrap: retry cargo invocations if stderr contains a known pattern #134472

Comments

marcoieni commented Dec 18, 2024

Why

Feature

Questions

Zulip links

jieyouxu commented Dec 18, 2024

Uh oh!

jieyouxu commented Dec 18, 2024

Uh oh!

marcoieni commented Dec 18, 2024

Uh oh!

jieyouxu commented Dec 18, 2024

Uh oh!

Kobzol commented Dec 18, 2024

Uh oh!

onur-ozkan commented Dec 18, 2024

Uh oh!

clubby789 commented Dec 18, 2024

Uh oh!

onur-ozkan commented Dec 19, 2024

Uh oh!

marcoieni commented Dec 19, 2024

Uh oh!

jieyouxu commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcoieni commented Dec 19, 2024

Uh oh!

Kobzol commented Dec 19, 2024

Uh oh!

jieyouxu commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcoieni commented Dec 19, 2024

Uh oh!

bootstrap: retry `cargo` invocations if stderr contains a known pattern #134472

bootstrap: retry `cargo` invocations if stderr contains a known pattern #134472

jieyouxu commented Dec 19, 2024 •

edited

Loading

jieyouxu commented Dec 19, 2024 •

edited

Loading