Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

bootstrap: retry cargo invocations if stderr contains a known pattern #134472

Open
marcoieni opened this issue Dec 18, 2024 · 14 comments
Open

bootstrap: retry cargo invocations if stderr contains a known pattern #134472

marcoieni opened this issue Dec 18, 2024 · 14 comments
Labels
A-bootstrap-config Area: bootstrap `config.toml` and the config system A-CI Area: Our Github Actions CI C-enhancement Category: An issue proposing an enhancement or a PR with one. E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.

Comments

@marcoieni
Copy link
Member

Why

Our CI auto builds sometimes fail for known reasons that are not related to the PRs we are trying to merge.

Most of the times, these errors are hard to understand and fix (or can't be fixed at all), decreasing the success rate of the auto builds for several weeks or months.
The impact is that we loose days of parallel compute time and hours of maintainers time that need to analyze the error message and reschedule the PRs in the merge queue.

Feature

We want to list the the stderr of the known issues we are aware of in the config.toml file that bootstrap uses. These patterns can be expressed as regex.
We want bootstrap to retry cargo invocations up to two times if stderr matches one of the listed patterns.

This would help reduce the failure rate of our CI because it would significantly reduce the percentage of jobs failing due to spurious errors.

The error messages need to be precise enough to avoid retrying cargo invocations over genuine problems.

Known error patterns can be found here. Not all of them can be listed.

As a start, we could just have 1 stderr string in the list (this one doesn't need to be a regex):

Questions

  • Is config.toml the right place to put the known stderr patterns? In Zulip, Jieyou proposed introducing another file: retry-patterns.toml. I'll leave it to the bootstrap team to decide.
  • Which format do we use to write the stderr patterns in the config.toml file? For example, it can be an array of strings. It could also be an "object" if we want to customize how many times to retry per error message. I'll leave it to the boostrap team to decide.
  • how do we make sure these patterns are present in the config.toml used for CI? I'm not familiar with how the config.toml for the CI is generated.

Zulip links

  • idea proposed here
  • agreement reached here
@marcoieni marcoieni added A-bootstrap-config Area: bootstrap `config.toml` and the config system T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Dec 18, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 18, 2024
@jieyouxu jieyouxu added E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot. C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Dec 18, 2024
@jieyouxu
Copy link
Member

Is config.toml the right place to put the known stderr patterns? In Zulip, Jieyou proposed introducing another file: retry-patterns.toml. I'll leave it to the bootstrap team to decide.

FTR config.toml is usually for local user config (CI sometimes writes their own config.toml) and that is specifically .gitignore'd, but I haven't thought too hard about the pros and cons of including it in the same config file.

@jieyouxu
Copy link
Member

Notably, if we decide to pursue this approach, we need to be precise enough in the error patterns to not retry genuine failures.

@marcoieni
Copy link
Member Author

Ok. Let us know when the bootstrap team is able to answer the questions listed, so that somebody can proceed with the implementation.
I could have some capacity to work on this in the week of 6 Jan for example 👍

@jieyouxu
Copy link
Member

cc @rust-lang/bootstrap: what do we think about this?

@Kobzol
Copy link
Contributor

Kobzol commented Dec 18, 2024

This sounds reasonable, it would definitely be useful.

Note that implementing this directly in bootstrap might be tricky. We sometimes don't capture stderr of cargo (in fact I think that in most situations we don't capture it?), but rather redirect it to stderr of the bootstrap. Furthermore, we would probably need to do this on a per-Cargo invocation, and we invoke Cargo on many places. So to avoid doing these restarts on a lot of source locations inside bootstrap, which would be quite crazy, it would be better to somehow thread it through this, which would ideally do the restart automatically inside, without the rest of bootstrap having to think about it.

I wouldn't be opposed against doing this outside of bootstrap, tbh, by just wrapping the top-level run build CI step in a loop that would do rm -rf build && run.sh and then restart the loop N times if the generated stderr contains the invalid substrings. More specifically, I'm not sure if I agree with this comment, given the state of bootstrap's codebase 😆 But this has to be tried out.

Anyway, if we do it inside bootstrap, we can either put it into config.toml or add a CLI flag, that's probably not super important. We can modify config.toml on CI using the --set argument, we already do that on several places.

Btw, it looks like the linked issue is not a spurious failure, but rather a known issue (with a known fix). A better candidate might be the MSVC locking failures.

@onur-ozkan
Copy link
Member

This could be useful only for CI, right? I think this should be done outside of the bootstrap using a wrapper script that handles this magic. It doesn't seem necessary to include it in the bootstrap as it's unrelated to its core purpose.

Also, how do we plan to handle error reports? If we silently skip these errors, we might forget to fix them which could increase build time for certain runners.

@clubby789
Copy link
Contributor

What if we just have the existing failure/RLA messages with a note that the step is being retried?

@onur-ozkan
Copy link
Member

What if we just have the existing failure/RLA messages with a note that the step is being retried?

Yeah, that should be enough.

@marcoieni
Copy link
Member Author

If we do this outside of bootstrap can we retry at the same level of granularity?

E.g. CI, could retry ./x.py --stage 2 test, but that would take a lot of time.

Instead, bootstrap knows the cargo command that failed, so it can retry only that command, which is faster, right?

@jieyouxu
Copy link
Member

jieyouxu commented Dec 19, 2024

If we do this outside of bootstrap can we retry at the same level of granularity?

E.g. CI, could retry ./x.py --stage 2 test, but that would take a lot of time.

Instead, bootstrap knows the cargo command that failed, so it can retry only that command, which is faster, right?

If we don't invalidate the build cache (e.g. do it in the same job invocation), then bootstrap (via cargo) does not need to rebuild things that have already successfully built (modulo incompatible-rustflags bugs) (at the crate-level granularity I believe).

@marcoieni
Copy link
Member Author

Right! So the cost is not 2x for sure. But we would need to re-run all the other tests that already passed, right? We could encounter the same error in another test that previously passed potentially. 🤔

Of course the final decision is yours btw and retrying outside of bootstrap might already be a good starting point. I'm just giving my 2 cents 👍

@Kobzol
Copy link
Contributor

Kobzol commented Dec 19, 2024

It will be probably quite faster due to the cache already being on disk, and also some tests will probably get skipped if the compiler didn't get rebuilt.

But even if it took the full CI time again, I don't think that's too bad. The important thing is that it would be automated, so people don't have to retry all the time, and that it would save CI resources, because only a single job would be restarted.

What might be problematic regarding longer CI time is the timeout, which is currently probably too short for a retry mechanism that reruns the full CI.

@jieyouxu
Copy link
Member

jieyouxu commented Dec 19, 2024

But we would need to re-run all the other tests that already passed, right?

Not quite either! For compiletest-managed tests under tests/*, if the previous results were successful (or ignored), then the test will not be rerun if itself and its dependencies have not changed (modulo test dependency check bugs). Let's call this "lazy-rerun" property (I don't recall what a better term is on top of my head).

I believe other cargo test-managed tests (crate tests, tool tests) have similar lazy-rerun properties.

(Of course, they can be forced to rerun via --force-rerun.)

We could encounter the same error in another test that previously passed potentially.

That's indicative of a flaky test, which is going to be problematic anyway (in the previous lazy-rerun scheme, this violates the assumption that a test is expected to be reproducible).

Of course the final decision is yours btw and retrying outside of bootstrap might already be a good starting point. I'm just giving my 2 cents

@marcoieni to be clear, I'm not against giving it a shot inside bootstrap. I have not tried to implement this myself, so you may find it to be more tricky than expected / run into other issues that we have not and could not have predicted at this present time. You may also want to also give the non-bootstrap approach a rough experiment -- to compare which approach seems more maintainable/reliable. FWIW, I'm happy to review and discuss a spurious build retry implementation if you want to give it a shot. Without any concrete experiments, all we can do is roughly speculate what problems either approach may have.

@marcoieni
Copy link
Member Author

Ok, thanks everyone!

It seems that the safest approach for now is experimenting in CI retrying ./x.py test if we see some known patterns 👍

Than we could move this mechanism to bootstrap itself if we think that there are some major benefits.

@onur-ozkan onur-ozkan added T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. A-CI Area: Our Github Actions CI labels Dec 19, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-bootstrap-config Area: bootstrap `config.toml` and the config system A-CI Area: Our Github Actions CI C-enhancement Category: An issue proposing an enhancement or a PR with one. E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants