You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a follow up to several incidents where the failure described in #9609 was the root cause.
Proposal
In the outage situation described in the linked ticket above, there is currently no easy way in Consul to reduce the write load on the servers without external coordination like shutting down consumers or reconfiguring rate limits on all clients.
It would be incredibly useful to be able to have a "break glass" emergency setting that allowed operators to specifically introduce a rate limit for accepting new writes into raft. This must be reloadable without restarting the leader to be effective.
We have other investigations about the best way to implement automatic backpressure that will improve Consul server stability in general, there are numerous factors to take into account there that make it a much more complex problem to research and design a solution for so having this simple solution seem important to provide a way to recover without full downtime if this situation does occur. This proposal is not a replacement for more general work but is a quicker path to improving recovery options for operators during incidents.
The idea would be to have a hot-reloadable rate.Limit that could be configured on the leader to just error any calls to raftApply with a "try again later" error equivalent to an HTTP 429. For requests that originate from client HTTP requests we should ensure that our standard HTTP handler converts those errors into a 429. We have one example of this already in the Connect CA certificate signing RPC endpoint which has a similar back-pressure mechanism.
In use, we'd accept that some writes would fail and some clients would see errors, but this is often preferable to having to shut down the whole cluster to recover which is the only other option in this case often.
Smoothing
We could wait for a short period before returning the error if rate limit is exceeded which allows handling of small spikes above the limit a little more fairly. See
There may be internal code paths within Consul that write to raft which might not tolerate this error nicely. For example if we have internal leader goroutines that modify state and currently treat a failure to write raft as fatal and force leader stepdown or similar. We'd need to check that carefully and possible have an internal flag that forces internal writes to bypass the limit.
Alternatively we could only enforce the limit higher up for server RPCs so internal writes will always work.
Proposed Use
We should document the intended usage. It can be used to relieve write load on the cluster without taking it offline by limiting the overall write throughput. Operators could slowly lower the limit until the write throughput is low enough to allow unhealthy followers to catch up. If this is causing a significant error rate in downstream services, time can be minimised by only adding the rate limit for the time it takes to get a new follower caught up after a restart. Each follower can then be restarted in turn with the rate dropped only for the time needed for them to catch up, after a few such restarts, the increased raft_trailing_logs should take affect from one of the restarted servers and remove the need for further rate limiting or errors.
While this is not a perfect process, it's much more controlled and preferable to unknown whole-cluster downtime that is currently required since you can't keep a quorum of servers healthy while restarting to change the config to one that will work.
The text was updated successfully, but these errors were encountered:
We're currently pursuing the approach to rate-limiting discussed in #12593 ⭐
That said, having a way to control the overall write rate would be super useful (particularly for the "trailing logs" problem described in this issue) so we may well revisit this in the future!
Background
This is a follow up to several incidents where the failure described in #9609 was the root cause.
Proposal
In the outage situation described in the linked ticket above, there is currently no easy way in Consul to reduce the write load on the servers without external coordination like shutting down consumers or reconfiguring rate limits on all clients.
It would be incredibly useful to be able to have a "break glass" emergency setting that allowed operators to specifically introduce a rate limit for accepting new writes into raft. This must be reloadable without restarting the leader to be effective.
We have other investigations about the best way to implement automatic backpressure that will improve Consul server stability in general, there are numerous factors to take into account there that make it a much more complex problem to research and design a solution for so having this simple solution seem important to provide a way to recover without full downtime if this situation does occur. This proposal is not a replacement for more general work but is a quicker path to improving recovery options for operators during incidents.
The idea would be to have a hot-reloadable
rate.Limit
that could be configured on the leader to just error any calls toraftApply
with a "try again later" error equivalent to an HTTP 429. For requests that originate from client HTTP requests we should ensure that our standard HTTP handler converts those errors into a 429. We have one example of this already in the Connect CA certificate signing RPC endpoint which has a similar back-pressure mechanism.In use, we'd accept that some writes would fail and some clients would see errors, but this is often preferable to having to shut down the whole cluster to recover which is the only other option in this case often.
Smoothing
We could wait for a short period before returning the error if rate limit is exceeded which allows handling of small spikes above the limit a little more fairly. See
consul/agent/consul/server_connect.go
Lines 195 to 200 in a432730
More info also in:
consul/agent/consul/server_connect.go
Lines 64 to 69 in a432730
Possible complications
There may be internal code paths within Consul that write to raft which might not tolerate this error nicely. For example if we have internal leader goroutines that modify state and currently treat a failure to write raft as fatal and force leader stepdown or similar. We'd need to check that carefully and possible have an internal flag that forces internal writes to bypass the limit.
Alternatively we could only enforce the limit higher up for server RPCs so internal writes will always work.
Proposed Use
We should document the intended usage. It can be used to relieve write load on the cluster without taking it offline by limiting the overall write throughput. Operators could slowly lower the limit until the write throughput is low enough to allow unhealthy followers to catch up. If this is causing a significant error rate in downstream services, time can be minimised by only adding the rate limit for the time it takes to get a new follower caught up after a restart. Each follower can then be restarted in turn with the rate dropped only for the time needed for them to catch up, after a few such restarts, the increased
raft_trailing_logs
should take affect from one of the restarted servers and remove the need for further rate limiting or errors.While this is not a perfect process, it's much more controlled and preferable to unknown whole-cluster downtime that is currently required since you can't keep a quorum of servers healthy while restarting to change the config to one that will work.
The text was updated successfully, but these errors were encountered: