ref(server): Make health_check fully async #3567

jan-auer · 2024-05-07T15:34:05Z

Moves actual health checks into a continuous background process so that
calls to the health check service always succeed instantly. If the
background status has not been updated for longer than a threshold, the
health check is assumed unhealthy.

This PR keeps the individual timeouts around the four probes (where only
three of them can actually time out). In a follow-up, we can consider to
further split down each of the probes and keep separate states.

relay-server/src/services/health_check.rs

Dav1dde

Since the status is now just read from a RwLock/watch, we can do the same we do for global configs and give the endpoint a HealthCheckHandle which can read the status directly from the watch, which makes checking the health check actually sync without any message queues.

relay-server/src/endpoints/mod.rs

relay-server/src/services/health_check.rs

jan-auer · 2024-05-07T17:17:04Z

relay-server/src/services/health_check.rs

-            Memory {
-                used: self.system.used_memory(),
-                total: self.system.total_memory(),
+                tokio::time::sleep(check_interval).await;


Note that in many other places we instead create a tokio::time::interval and then use tokio::select! with the shutdown handle and the interval. I opted for a different solution here as we don't need to stick to strict intervals and this version ended up less code. The alternative would look like this:

let mut ticker = tokio::time::interval(check_interval); let mut shutdown = Controller::shutdown_handle(); loop { let _ = update_tx.send(StatusUpdate::new(relay_statsd::metric!( timer(RelayTimers::HealthCheckDuration), type = "readiness", { self.check_readiness().await } ))); tokio::select! { biased; _ = ticker.tick() => (), _ = shutdown.notified() => break, } }

relay-server/src/services/health_check.rs

relay-config/src/config.rs

relay-server/src/services/health_check.rs

jjbayer · 2024-05-08T07:17:48Z

relay-server/src/services/health_check.rs

-            Memory {
-                used: self.system.used_memory(),
-                total: self.system.total_memory(),
+                tokio::time::sleep(check_interval).await;


With the join!, if one health check times out, other unrelated health checks will be delayed as well IIUC. Unless the timeout is always lower than the check interval.

We could make this a sleep_until(next_check_time) to be a little more robust against timeouts spilling over the check interval.

We can use the pattern from other places if you prefer, but please check #3567 (comment)

The validity timeout is the interval + poll timeout, so we'd never timeout prematurely.

* master: ref(aws): Remove the aws extension (#3568) ref(metrics): Change MetricHour data category to MetricSecond (#3558) chore(self-hosted): Mark e2e test check as required (#3557) feat(spans): Extracts messaging.message.id for queue spans (#3556)

jan-auer added 3 commits May 7, 2024 17:32

ref(server): Make health_check fully async

52b3fe0

fix: Add safety margin to timeouts

df6c1b6

ref: Remove indirection

0f21a04

jan-auer commented May 7, 2024

View reviewed changes

relay-server/src/services/health_check.rs Outdated Show resolved Hide resolved

jan-auer added 4 commits May 7, 2024 17:46

ref: Simplify

6f56e5f

fix: Reintroduce probing for sys mem

ed223de

meta: Changelog

23c9245

fix: Lint

14708f4

jan-auer self-assigned this May 7, 2024

jan-auer marked this pull request as ready for review May 7, 2024 15:55

jan-auer requested a review from a team as a code owner May 7, 2024 15:55

Dav1dde reviewed May 7, 2024

View reviewed changes

jan-auer added 4 commits May 7, 2024 18:45

fix: Undo unintended change

e5d7d43

ref: Check network outages concurrently

7f643da

ref: Treat shutdown differently

1d9c513

ref: Simplify

322cbc8

jan-auer commented May 7, 2024

View reviewed changes

ref: Rename refresh interval option

ac5fccc

jan-auer commented May 8, 2024

View reviewed changes

relay-server/src/services/health_check.rs Outdated Show resolved Hide resolved

Dav1dde approved these changes May 8, 2024

View reviewed changes

relay-server/src/services/health_check.rs Outdated Show resolved Hide resolved

relay-config/src/config.rs Outdated Show resolved Hide resolved

relay-config/src/config.rs Outdated Show resolved Hide resolved

relay-server/src/services/health_check.rs Show resolved Hide resolved

jjbayer approved these changes May 8, 2024

View reviewed changes

jan-auer added 6 commits May 8, 2024 09:36

ref: Review comments

2bcf7d6

fix: Tests

e0c233b

fix: Tests

2f6acc3

fix: Tests for real

d4f64c5

Merge branch 'master' into ref/healthcheck-async

3ce7f33

jan-auer merged commit 2c65db4 into master May 13, 2024
22 checks passed

jan-auer deleted the ref/healthcheck-async branch May 13, 2024 07:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(server): Make health_check fully async #3567

ref(server): Make health_check fully async #3567

jan-auer commented May 7, 2024 •

edited

Loading

Dav1dde left a comment

jan-auer May 7, 2024

jjbayer May 8, 2024

jan-auer May 8, 2024

ref(server): Make health_check fully async #3567

ref(server): Make health_check fully async #3567

Conversation

jan-auer commented May 7, 2024 • edited Loading

Dav1dde left a comment

Choose a reason for hiding this comment

jan-auer May 7, 2024

Choose a reason for hiding this comment

jjbayer May 8, 2024

Choose a reason for hiding this comment

jan-auer May 8, 2024

Choose a reason for hiding this comment

jan-auer commented May 7, 2024 •

edited

Loading