-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
ref(server): Make health_check fully async #3567
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the status is now just read from a RwLock
/watch
, we can do the same we do for global configs and give the endpoint a HealthCheckHandle
which can read the status directly from the watch
, which makes checking the health check actually sync without any message queues.
Memory { | ||
used: self.system.used_memory(), | ||
total: self.system.total_memory(), | ||
tokio::time::sleep(check_interval).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in many other places we instead create a tokio::time::interval
and then use tokio::select!
with the shutdown handle and the interval. I opted for a different solution here as we don't need to stick to strict intervals and this version ended up less code. The alternative would look like this:
let mut ticker = tokio::time::interval(check_interval);
let mut shutdown = Controller::shutdown_handle();
loop {
let _ = update_tx.send(StatusUpdate::new(relay_statsd::metric!(
timer(RelayTimers::HealthCheckDuration),
type = "readiness",
{ self.check_readiness().await }
)));
tokio::select! {
biased;
_ = ticker.tick() => (),
_ = shutdown.notified() => break,
}
}
Memory { | ||
used: self.system.used_memory(), | ||
total: self.system.total_memory(), | ||
tokio::time::sleep(check_interval).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the join!
, if one health check times out, other unrelated health checks will be delayed as well IIUC. Unless the timeout is always lower than the check interval.
We could make this a sleep_until(next_check_time)
to be a little more robust against timeouts spilling over the check interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use the pattern from other places if you prefer, but please check #3567 (comment)
The validity timeout is the interval + poll timeout, so we'd never timeout prematurely.
Moves actual health checks into a continuous background process so that
calls to the health check service always succeed instantly. If the
background status has not been updated for longer than a threshold, the
health check is assumed unhealthy.
This PR keeps the individual timeouts around the four probes (where only
three of them can actually time out). In a follow-up, we can consider to
further split down each of the probes and keep separate states.