-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
bootstrap: retry cargo
invocations if stderr contains a known pattern
#134472
Comments
FTR |
Notably, if we decide to pursue this approach, we need to be precise enough in the error patterns to not retry genuine failures. |
Ok. Let us know when the bootstrap team is able to answer the questions listed, so that somebody can proceed with the implementation. |
cc @rust-lang/bootstrap: what do we think about this? |
This sounds reasonable, it would definitely be useful. Note that implementing this directly in bootstrap might be tricky. We sometimes don't capture stderr of cargo (in fact I think that in most situations we don't capture it?), but rather redirect it to stderr of the bootstrap. Furthermore, we would probably need to do this on a per-Cargo invocation, and we invoke Cargo on many places. So to avoid doing these restarts on a lot of source locations inside bootstrap, which would be quite crazy, it would be better to somehow thread it through this, which would ideally do the restart automatically inside, without the rest of bootstrap having to think about it. I wouldn't be opposed against doing this outside of bootstrap, tbh, by just wrapping the top-level Anyway, if we do it inside bootstrap, we can either put it into Btw, it looks like the linked issue is not a spurious failure, but rather a known issue (with a known fix). A better candidate might be the MSVC locking failures. |
This could be useful only for CI, right? I think this should be done outside of the bootstrap using a wrapper script that handles this magic. It doesn't seem necessary to include it in the bootstrap as it's unrelated to its core purpose. Also, how do we plan to handle error reports? If we silently skip these errors, we might forget to fix them which could increase build time for certain runners. |
What if we just have the existing failure/RLA messages with a note that the step is being retried? |
Yeah, that should be enough. |
If we do this outside of bootstrap can we retry at the same level of granularity? E.g. CI, could retry Instead, bootstrap knows the cargo command that failed, so it can retry only that command, which is faster, right? |
If we don't invalidate the build cache (e.g. do it in the same job invocation), then bootstrap (via cargo) does not need to rebuild things that have already successfully built (modulo incompatible-rustflags bugs) (at the crate-level granularity I believe). |
Right! So the cost is not 2x for sure. But we would need to re-run all the other tests that already passed, right? We could encounter the same error in another test that previously passed potentially. 🤔 Of course the final decision is yours btw and retrying outside of bootstrap might already be a good starting point. I'm just giving my 2 cents 👍 |
It will be probably quite faster due to the cache already being on disk, and also some tests will probably get skipped if the compiler didn't get rebuilt. But even if it took the full CI time again, I don't think that's too bad. The important thing is that it would be automated, so people don't have to retry all the time, and that it would save CI resources, because only a single job would be restarted. What might be problematic regarding longer CI time is the timeout, which is currently probably too short for a retry mechanism that reruns the full CI. |
Not quite either! For compiletest-managed tests under I believe other (Of course, they can be forced to rerun via
That's indicative of a flaky test, which is going to be problematic anyway (in the previous lazy-rerun scheme, this violates the assumption that a test is expected to be reproducible).
@marcoieni to be clear, I'm not against giving it a shot inside bootstrap. I have not tried to implement this myself, so you may find it to be more tricky than expected / run into other issues that we have not and could not have predicted at this present time. You may also want to also give the non-bootstrap approach a rough experiment -- to compare which approach seems more maintainable/reliable. FWIW, I'm happy to review and discuss a spurious build retry implementation if you want to give it a shot. Without any concrete experiments, all we can do is roughly speculate what problems either approach may have. |
Ok, thanks everyone! It seems that the safest approach for now is experimenting in CI retrying Than we could move this mechanism to bootstrap itself if we think that there are some major benefits. |
Why
Our CI auto builds sometimes fail for known reasons that are not related to the PRs we are trying to merge.
Most of the times, these errors are hard to understand and fix (or can't be fixed at all), decreasing the success rate of the auto builds for several weeks or months.
The impact is that we loose days of parallel compute time and hours of maintainers time that need to analyze the error message and reschedule the PRs in the merge queue.
Feature
We want to list the the stderr of the known issues we are aware of in the
config.toml
file that bootstrap uses. These patterns can be expressed as regex.We want
bootstrap
to retry cargo invocations up to two times if stderr matches one of the listed patterns.This would help reduce the failure rate of our CI because it would significantly reduce the percentage of jobs failing due to spurious errors.
The error messages need to be precise enough to avoid retrying cargo invocations over genuine problems.
Known error patterns can be found here. Not all of them can be listed.
As a start, we could just have 1 stderr string in the list (this one doesn't need to be a regex):
ranlib.exe: could not create temporary file whilst writing archive: no more archived files
which is discussed in CI: spurious error building LLVM on mingw:ranlib.exe: could not create temporary file whilst writing archive: no more archived files
#108227Questions
config.toml
the right place to put the known stderr patterns? In Zulip, Jieyou proposed introducing another file:retry-patterns.toml
. I'll leave it to the bootstrap team to decide.config.toml
file? For example, it can be an array of strings. It could also be an "object" if we want to customize how many times to retry per error message. I'll leave it to the boostrap team to decide.config.toml
used for CI? I'm not familiar with how theconfig.toml
for the CI is generated.Zulip links
The text was updated successfully, but these errors were encountered: