Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ref(server): Make health_check fully async #3567

Merged
merged 18 commits into from
May 13, 2024
Merged

Conversation

jan-auer
Copy link
Member

@jan-auer jan-auer commented May 7, 2024

Moves actual health checks into a continuous background process so that
calls to the health check service always succeed instantly. If the
background status has not been updated for longer than a threshold, the
health check is assumed unhealthy.

This PR keeps the individual timeouts around the four probes (where only
three of them can actually time out). In a follow-up, we can consider to
further split down each of the probes and keep separate states.

@jan-auer jan-auer self-assigned this May 7, 2024
@jan-auer jan-auer marked this pull request as ready for review May 7, 2024 15:55
@jan-auer jan-auer requested a review from a team as a code owner May 7, 2024 15:55
Copy link
Member

@Dav1dde Dav1dde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the status is now just read from a RwLock/watch, we can do the same we do for global configs and give the endpoint a HealthCheckHandle which can read the status directly from the watch, which makes checking the health check actually sync without any message queues.

relay-server/src/endpoints/mod.rs Outdated Show resolved Hide resolved
relay-server/src/services/health_check.rs Outdated Show resolved Hide resolved
relay-server/src/services/health_check.rs Outdated Show resolved Hide resolved
relay-server/src/services/health_check.rs Show resolved Hide resolved
Memory {
used: self.system.used_memory(),
total: self.system.total_memory(),
tokio::time::sleep(check_interval).await;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in many other places we instead create a tokio::time::interval and then use tokio::select! with the shutdown handle and the interval. I opted for a different solution here as we don't need to stick to strict intervals and this version ended up less code. The alternative would look like this:

let mut ticker = tokio::time::interval(check_interval);
let mut shutdown = Controller::shutdown_handle();

loop {
    let _ = update_tx.send(StatusUpdate::new(relay_statsd::metric!(
        timer(RelayTimers::HealthCheckDuration),
        type = "readiness",
        { self.check_readiness().await }
    )));

    tokio::select! {
        biased;
        _ = ticker.tick() => (),
        _ = shutdown.notified() => break,
    }
}

relay-server/src/services/health_check.rs Outdated Show resolved Hide resolved
relay-config/src/config.rs Outdated Show resolved Hide resolved
relay-config/src/config.rs Outdated Show resolved Hide resolved
relay-server/src/services/health_check.rs Show resolved Hide resolved
relay-server/src/services/health_check.rs Outdated Show resolved Hide resolved
Memory {
used: self.system.used_memory(),
total: self.system.total_memory(),
tokio::time::sleep(check_interval).await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the join!, if one health check times out, other unrelated health checks will be delayed as well IIUC. Unless the timeout is always lower than the check interval.

We could make this a sleep_until(next_check_time) to be a little more robust against timeouts spilling over the check interval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the pattern from other places if you prefer, but please check #3567 (comment)

The validity timeout is the interval + poll timeout, so we'd never timeout prematurely.

* master:
  ref(aws): Remove the aws extension (#3568)
  ref(metrics): Change MetricHour data category to MetricSecond (#3558)
  chore(self-hosted): Mark e2e test check as required (#3557)
  feat(spans): Extracts messaging.message.id for queue spans (#3556)
@jan-auer jan-auer merged commit 2c65db4 into master May 13, 2024
22 checks passed
@jan-auer jan-auer deleted the ref/healthcheck-async branch May 13, 2024 07:22
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants