Workaround docker is not running bug #590

Mark-Simulacrum · 2022-01-06T15:26:59Z

Currently, if an individual agent reports an error during execution (e.g., docker is not running, or one of its worker threads ended with an error), that job will be marked as failed in its entirety. Particularly as we currently have a transient agent (crater-gcp-1), which is sometimes down due to being a spot instance, this means that it can be hard to complete a Crater job if the GCP-1 instance is killing jobs midway through.

It should be noted that in theory these errors shouldn't happen in the first place. In practice it looks like "docker is not running" is the primary cause of failure -- which is relatively hard to investigate; logs for the relevant time period appear absent. This PR restructures the code which detects docker absence to instead spin until docker is up. It looks relatively more difficult to re-organize the crater code to deal well with worker failure, likely by re-assigning the jobs to a live worker, though that is likely a better long-term solution.

Mark-Simulacrum · 2022-01-06T18:56:14Z

Locally failure seems to be due to OOM (~54 GB compilation on stable), seems to be fixed on nightly, though not on beta.

Mark-Simulacrum · 2022-01-07T22:05:44Z

I think I know why there's not more information: I was looking at the crater server, but I should've been looking at agent logs. I'll try to track down the cause of the failure and actually fix it.

Mark-Simulacrum · 2022-01-07T22:40:07Z

@bors r=pietroalbini

bors · 2022-01-07T22:40:09Z

📌 Commit 2b49049 has been approved by pietroalbini

Workaround docker is not running bug Currently, if an individual agent reports an error during execution (e.g., docker is not running, or one of its worker threads ended with an error), that job will be marked as failed in its entirety. Particularly as we currently have a transient agent (crater-gcp-1), which is sometimes down due to being a spot instance, this means that it can be hard to complete a Crater job if the GCP-1 instance is killing jobs midway through. It should be noted that in theory these errors shouldn't happen in the first place. In practice it looks like "docker is not running" is the primary cause of failure -- which is relatively hard to investigate; logs for the relevant time period appear absent. This PR restructures the code which detects docker absence to instead spin until docker *is* up. It looks relatively more difficult to re-organize the crater code to deal well with worker failure, likely by re-assigning the jobs to a live worker, though that is likely a better long-term solution.

bors · 2022-01-07T22:40:15Z

⌛ Testing commit 2b49049 with merge 7f00788...

Stable compiler currently allocates too much memory (56+ GB). Nightly (1.59) is fixed with regards to this issue; 1.56.1 seems to have the same problem.

This spins indefinitely if docker is down, preventing total experiment failure in that case. For details on why this strategy is chosen, see the comment added.

Mark-Simulacrum · 2022-01-07T23:31:03Z

@bors r=pietroalbini

bors · 2022-01-07T23:31:05Z

📌 Commit 0abcb8c has been approved by pietroalbini

Workaround docker is not running bug Currently, if an individual agent reports an error during execution (e.g., docker is not running, or one of its worker threads ended with an error), that job will be marked as failed in its entirety. Particularly as we currently have a transient agent (crater-gcp-1), which is sometimes down due to being a spot instance, this means that it can be hard to complete a Crater job if the GCP-1 instance is killing jobs midway through. It should be noted that in theory these errors shouldn't happen in the first place. In practice it looks like "docker is not running" is the primary cause of failure -- which is relatively hard to investigate; logs for the relevant time period appear absent. This PR restructures the code which detects docker absence to instead spin until docker *is* up. It looks relatively more difficult to re-organize the crater code to deal well with worker failure, likely by re-assigning the jobs to a live worker, though that is likely a better long-term solution.

bors · 2022-01-07T23:31:10Z

⌛ Testing commit 0abcb8c with merge 5ffcaa4...

bors · 2022-01-07T23:55:45Z

💔 Test failed - checks-actions

Mark-Simulacrum · 2022-01-08T00:24:34Z

@bors r+

bors · 2022-01-08T00:24:35Z

📌 Commit 139d094 has been approved by Mark-Simulacrum

Workaround docker is not running bug Currently, if an individual agent reports an error during execution (e.g., docker is not running, or one of its worker threads ended with an error), that job will be marked as failed in its entirety. Particularly as we currently have a transient agent (crater-gcp-1), which is sometimes down due to being a spot instance, this means that it can be hard to complete a Crater job if the GCP-1 instance is killing jobs midway through. It should be noted that in theory these errors shouldn't happen in the first place. In practice it looks like "docker is not running" is the primary cause of failure -- which is relatively hard to investigate; logs for the relevant time period appear absent. This PR restructures the code which detects docker absence to instead spin until docker *is* up. It looks relatively more difficult to re-organize the crater code to deal well with worker failure, likely by re-assigning the jobs to a live worker, though that is likely a better long-term solution.

bors · 2022-01-08T00:24:40Z

⌛ Testing commit 139d094 with merge d7b9fb2...

bors · 2022-01-08T00:40:57Z

💔 Test failed - checks-actions

Mark-Simulacrum · 2022-01-08T01:54:03Z

@bors r+

bors · 2022-01-08T01:54:05Z

📌 Commit 0818083 has been approved by Mark-Simulacrum

bors · 2022-01-08T01:54:10Z

⌛ Testing commit 0818083 with merge 64bc28f...

bors · 2022-01-08T02:20:36Z

☀️ Test successful - checks-actions
Approved by: Mark-Simulacrum
Pushing 64bc28f to master...

Mark-Simulacrum requested a review from pietroalbini January 6, 2022 15:45

Mark-Simulacrum changed the title ~~Try to force logging crater errors~~ Prevent individual agent from killing crater job Jan 7, 2022

Mark-Simulacrum changed the title ~~Prevent individual agent from killing crater job~~ Workaround docker is not running bug Jan 7, 2022

pietroalbini approved these changes Jan 7, 2022

View reviewed changes

Mark-Simulacrum added 4 commits January 7, 2022 18:12

Fix lints

91d28f2

Switch to nightly due to stable compiler issue

c7ed9ab

Stable compiler currently allocates too much memory (56+ GB). Nightly (1.59) is fixed with regards to this issue; 1.56.1 seems to have the same problem.

Spin until docker is running

854d995

This spins indefinitely if docker is down, preventing total experiment failure in that case. For details on why this strategy is chosen, see the comment added.

install rustfmt component

0abcb8c

Fix windows paths

0818083

bors merged commit 64bc28f into rust-lang:master Jan 8, 2022

Mark-Simulacrum deleted the improve-errs branch January 8, 2022 02:29

This was referenced Jan 8, 2022

check all supertrait bounds when confirming dyn candidate rust-lang/rust#92285

Merged

[DO NOT MERGE] Forbid unused_lifetimes lint for Crater run rust-lang/rust#92413

Closed

discard default auto trait impls if explicit ones exist rust-lang/rust#85048

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround docker is not running bug #590

Workaround docker is not running bug #590

Mark-Simulacrum commented Jan 6, 2022 •

edited

Loading

Mark-Simulacrum commented Jan 6, 2022

Mark-Simulacrum commented Jan 7, 2022

Mark-Simulacrum commented Jan 7, 2022

bors commented Jan 7, 2022

bors commented Jan 7, 2022

Mark-Simulacrum commented Jan 7, 2022

bors commented Jan 7, 2022

bors commented Jan 7, 2022

bors commented Jan 7, 2022

Mark-Simulacrum commented Jan 8, 2022

bors commented Jan 8, 2022

bors commented Jan 8, 2022

bors commented Jan 8, 2022

Mark-Simulacrum commented Jan 8, 2022

bors commented Jan 8, 2022

bors commented Jan 8, 2022

bors commented Jan 8, 2022

Workaround docker is not running bug #590

Workaround docker is not running bug #590

Conversation

Mark-Simulacrum commented Jan 6, 2022 • edited Loading

Mark-Simulacrum commented Jan 6, 2022

Mark-Simulacrum commented Jan 7, 2022

Mark-Simulacrum commented Jan 7, 2022

bors commented Jan 7, 2022

bors commented Jan 7, 2022

Mark-Simulacrum commented Jan 7, 2022

bors commented Jan 7, 2022

bors commented Jan 7, 2022

bors commented Jan 7, 2022

Mark-Simulacrum commented Jan 8, 2022

bors commented Jan 8, 2022

bors commented Jan 8, 2022

bors commented Jan 8, 2022

Mark-Simulacrum commented Jan 8, 2022

bors commented Jan 8, 2022

bors commented Jan 8, 2022

bors commented Jan 8, 2022

Mark-Simulacrum commented Jan 6, 2022 •

edited

Loading