-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: frequent timeouts in {build,runBuilt,run}TestProg #44422
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
I've been looking at this problem on various systems. If I run 'go tool dist test go_test:runtime' by itself the time varies and sometimes on a power8 it times out, but never on a power9 because it is faster there. But one mysterious result is that if I set GOMAXPROCS to a smaller value, like 2, 4, or 8 then the time is small enough on the power8 so that it doesn't time out. I thought these VMs only had 2 processors each, which I thought was equivalent to running GOMAXPROCS=2. I can only get it to timeout if the number of processors is > 150. |
I don't know if this is related, but a while back I had a bunch of timeouts in my personal OSU power9 system. I asked them to increase it from 2 to 4 cores and 4 to 8GB RAM and now I have zero timeouts there. Maybe these builders are timing out because they have too little CPU time allocated to them in the hypervisor. @bradfitz can correct me if I'm wrong, but IIRC, the power8 builders are 2 core, 4GB RAM only. |
From golang.org/x/build/env/linux-ppc64le/osuosl/NOTES:
* go-le-bionic-1: (20 GB RAM, 50 GB disk, 10 cores, POWER9)
Linux go-le-bionic-1 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:08:54
UTC 2019 ppc64le ppc64le ppc64le GNU/Linux
* go-le-bionic-2: (20 GB RAM, 50 GB disk, 10 cores, POWER8)
Linux go-le-bionic-2 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:08:54
UTC 2019 ppc64le ppc64le ppc64le GNU/Linux
…On Tue, Feb 23, 2021 at 2:08 PM Carlos Eduardo Seo ***@***.***> wrote:
I don't know if this is related, but a while back I had a bunch of
timeouts in my personal OSU power9 system. I asked them to increase it from
2 to 4 cores and 4 to 8GB RAM and now I have zero timeouts there. Maybe
these builders are timing out because they have too little CPU time
allocated to them in the hypervisor.
@bradfitz <https://github.com/bradfitz> can correct me if I'm wrong, but
IIRC, the power8 builders are 2 core, 4GB RAM only.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#44422 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACH7BDGWKU6CUMSDJWWFK3TTAQDM5ANCNFSM4X4PGX5A>
.
|
@dmitshur I think we need to increase the GO_TEST_TIMEOUT_SCALE for all the linux ppc64* builders. I can make the change but don't know how those are tested. |
Change https://golang.org/cl/300870 mentions this issue: |
This sets the GO_TEST_TIMEOUT_SCALE=2 for ppc64 builders. The runtime tests have been timing out intermittently on ppc64le power8; it has happened on ppc64 and ppc64le power9 but very rarely. Updates golang/go#44422 Change-Id: I663f3f211a368a59e38fbff9ce43c925c6c7a209 Reviewed-on: https://go-review.googlesource.com/c/build/+/300870 Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
This may be the same underlying issue as #45885. |
This is affecting at least five different builders, with many failures per month. I think that makes it a release-blocker via #11811. All of the builders need to be passing tests reliably — if that means skipping more tests on non-longtest builders, we may need to prioritize which tests to run, or perhaps move some of the slower ones out to some package that won't also be run by users as part of |
Looking at the That makes me wonder whether these timeouts are actually deadlocks — perhaps related to the ones observed in #48789? |
#49614 also shows an apparent deadlock with processes blocked in |
It's interesting to note that in all the recent Solaris failures the runtime test times out while building a test with build flags, either from |
Change https://golang.org/cl/364654 mentions this issue: |
That is a strong pattern but not universal. The stack traces for the failures listed in #49614 seem to be at a point where the test is running the compiled program (and has been for over a minute), not waiting on a build process. |
Permit a test whose program is already built to run immediately, rather than waiting for another test to complete its build. For #44422 Change-Id: I2d1b35d055ee4c4251f4caef3b52dccc82b71a1b Reviewed-on: https://go-review.googlesource.com/c/go/+/364654 Trust: Ian Lance Taylor <iant@golang.org> Run-TryBot: Ian Lance Taylor <iant@golang.org> Reviewed-by: Bryan C. Mills <bcmills@google.com> TryBot-Result: Go Bot <gobot@golang.org>
After CL 364654 the time required to run the runtime tests on a Solaris system I have access to drops from 70 seconds to 51 seconds. CL 364755, if submitted, will drop it farther. I don't know if timeouts are the problem here, but they may be part of it. But there may also be a different problem. |
Seems like a good start, at least. We can watch the builders for a while and see whether it is empirically fixed. |
The last timeout in these functions was before the aforementioned changes, so I'm provisionally marking this as fixed.
|
Nope, still timing out occasionally:
|
In the 2021-11-22 timeout the only tests still running are two waiting for |
On a Solaris system I have access to, clearing the cache and running This is consistent with the builder failure just being slow and running out of time, especially since cmd/dist will be running other tests in parallel. |
2021-12-02T22:06:27-06dbf61/solaris-amd64-oraclerel All of these timeouts also appear to be actively compiling in Perhaps some of these slower tests should be skipped in short mode? |
Since we don't have a long builder for these, this seems like it would lose important test coverage.
This seems feasible. As you pointed out, this would give it its own timeout. It would require some nontrivial refactoring to pull out the testprog infrastructure, but we could do that. |
2022-01-06T23:39:43-042548b/solaris-amd64-oraclerel (building |
A much simpler thing we could do is to increase the timeout for the runtime test. We already do this for the cmd/go tests [1] and I don't think that's any more artificial than factoring the cgo prog tests out into their own test. [1] Or, at least, we try to. The code is weird. |
If increasing the timeout fixes the builder flakiness, I think it's a reasonable approach — Go users who run That said, if the problem is just the timeout length, should we raise the scale factors in (It would be nice if we could eliminate the |
The timeouts are from building a variant of threadprogcgo with |
Change https://golang.org/cl/379294 mentions this issue: |
Use an enviroment variable rather than a build tag to control starting a busy loop thread when testprogcgo starts. This lets us skip another build that invokes the C compiler and linker, which should avoid timeouts running the runtime tests. Fixes golang#44422 Change-Id: I516668d71a373da311d844990236566ff63e6d72 Reviewed-on: https://go-review.googlesource.com/c/go/+/379294 Trust: Ian Lance Taylor <iant@golang.org> Run-TryBot: Ian Lance Taylor <iant@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2021-02-19T00:04:42-1c659f2/linux-ppc64le-buildlet
2021-02-19T00:04:30-b6379f1/linux-ppc64le-buildlet
2021-02-19T00:04:22-09e059a/linux-ppc64le-buildlet
2021-02-19T00:02:06-01f05d8/linux-ppc64le-buildlet
2021-02-19T00:01:17-e7ee3c1/linux-ppc64le-buildlet
2021-02-16T22:18:37-8482559/linux-ppc64le-buildlet
2020-12-15T21:45:05-731bb54/linux-ppc64le-buildlet
2020-12-15T21:04:49-129bb19/linux-ppc64le-buildlet
2020-12-15T21:01:37-685a322/linux-ppc64le-buildlet
2020-12-15T20:58:17-3d64678/linux-ppc64le-buildlet
2020-12-15T20:55:01-7cdc84a/linux-ppc64le-buildlet
2020-12-14T22:39:04-663cd86/linux-ppc64le-buildlet
2020-12-14T21:09:33-d06794d/linux-ppc64le-buildlet
2020-12-14T18:06:06-828746e/linux-ppc64le-buildlet
It's not clear to me whether the test is hung or just slow. Perhaps we should set a
GO_TEST_TIMEOUT_SCALE
on this builder and see if that helps?CC @golang/release @bradfitz @laboger @ceseo
The text was updated successfully, but these errors were encountered: