-
Notifications
You must be signed in to change notification settings - Fork 94
Workaround docker is not running bug #590
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Locally failure seems to be due to OOM (~54 GB compilation on stable), seems to be fixed on nightly, though not on beta. |
I think I know why there's not more information: I was looking at the crater server, but I should've been looking at agent logs. I'll try to track down the cause of the failure and actually fix it. |
@bors r=pietroalbini |
📌 Commit 2b49049 has been approved by |
Workaround docker is not running bug Currently, if an individual agent reports an error during execution (e.g., docker is not running, or one of its worker threads ended with an error), that job will be marked as failed in its entirety. Particularly as we currently have a transient agent (crater-gcp-1), which is sometimes down due to being a spot instance, this means that it can be hard to complete a Crater job if the GCP-1 instance is killing jobs midway through. It should be noted that in theory these errors shouldn't happen in the first place. In practice it looks like "docker is not running" is the primary cause of failure -- which is relatively hard to investigate; logs for the relevant time period appear absent. This PR restructures the code which detects docker absence to instead spin until docker *is* up. It looks relatively more difficult to re-organize the crater code to deal well with worker failure, likely by re-assigning the jobs to a live worker, though that is likely a better long-term solution.
Stable compiler currently allocates too much memory (56+ GB). Nightly (1.59) is fixed with regards to this issue; 1.56.1 seems to have the same problem.
This spins indefinitely if docker is down, preventing total experiment failure in that case. For details on why this strategy is chosen, see the comment added.
@bors r=pietroalbini |
📌 Commit 0abcb8c has been approved by |
Workaround docker is not running bug Currently, if an individual agent reports an error during execution (e.g., docker is not running, or one of its worker threads ended with an error), that job will be marked as failed in its entirety. Particularly as we currently have a transient agent (crater-gcp-1), which is sometimes down due to being a spot instance, this means that it can be hard to complete a Crater job if the GCP-1 instance is killing jobs midway through. It should be noted that in theory these errors shouldn't happen in the first place. In practice it looks like "docker is not running" is the primary cause of failure -- which is relatively hard to investigate; logs for the relevant time period appear absent. This PR restructures the code which detects docker absence to instead spin until docker *is* up. It looks relatively more difficult to re-organize the crater code to deal well with worker failure, likely by re-assigning the jobs to a live worker, though that is likely a better long-term solution.
💔 Test failed - checks-actions |
@bors r+ |
📌 Commit 139d094 has been approved by |
Workaround docker is not running bug Currently, if an individual agent reports an error during execution (e.g., docker is not running, or one of its worker threads ended with an error), that job will be marked as failed in its entirety. Particularly as we currently have a transient agent (crater-gcp-1), which is sometimes down due to being a spot instance, this means that it can be hard to complete a Crater job if the GCP-1 instance is killing jobs midway through. It should be noted that in theory these errors shouldn't happen in the first place. In practice it looks like "docker is not running" is the primary cause of failure -- which is relatively hard to investigate; logs for the relevant time period appear absent. This PR restructures the code which detects docker absence to instead spin until docker *is* up. It looks relatively more difficult to re-organize the crater code to deal well with worker failure, likely by re-assigning the jobs to a live worker, though that is likely a better long-term solution.
💔 Test failed - checks-actions |
@bors r+ |
📌 Commit 0818083 has been approved by |
☀️ Test successful - checks-actions |
Currently, if an individual agent reports an error during execution (e.g., docker is not running, or one of its worker threads ended with an error), that job will be marked as failed in its entirety. Particularly as we currently have a transient agent (crater-gcp-1), which is sometimes down due to being a spot instance, this means that it can be hard to complete a Crater job if the GCP-1 instance is killing jobs midway through.
It should be noted that in theory these errors shouldn't happen in the first place. In practice it looks like "docker is not running" is the primary cause of failure -- which is relatively hard to investigate; logs for the relevant time period appear absent. This PR restructures the code which detects docker absence to instead spin until docker is up. It looks relatively more difficult to re-organize the crater code to deal well with worker failure, likely by re-assigning the jobs to a live worker, though that is likely a better long-term solution.