Add emergency server write rate limit to allow controlled recovery from replication failure #9624

banks · 2021-01-22T18:02:48Z

Background

This is a follow up to several incidents where the failure described in #9609 was the root cause.

Proposal

In the outage situation described in the linked ticket above, there is currently no easy way in Consul to reduce the write load on the servers without external coordination like shutting down consumers or reconfiguring rate limits on all clients.

It would be incredibly useful to be able to have a "break glass" emergency setting that allowed operators to specifically introduce a rate limit for accepting new writes into raft. This must be reloadable without restarting the leader to be effective.

We have other investigations about the best way to implement automatic backpressure that will improve Consul server stability in general, there are numerous factors to take into account there that make it a much more complex problem to research and design a solution for so having this simple solution seem important to provide a way to recover without full downtime if this situation does occur. This proposal is not a replacement for more general work but is a quicker path to improving recovery options for operators during incidents.

The idea would be to have a hot-reloadable rate.Limit that could be configured on the leader to just error any calls to raftApply with a "try again later" error equivalent to an HTTP 429. For requests that originate from client HTTP requests we should ensure that our standard HTTP handler converts those errors into a 429. We have one example of this already in the Connect CA certificate signing RPC endpoint which has a similar back-pressure mechanism.

In use, we'd accept that some writes would fail and some clients would see errors, but this is often preferable to having to shut down the whole cluster to recover which is the only other option in this case often.

Smoothing

We could wait for a short period before returning the error if rate limit is exceeded which allows handling of small spikes above the limit a little more fairly. See

consul/agent/consul/server_connect.go

Lines 195 to 200 in a432730

    
           // Wait up to the small threshold we allow for a token. 
        
           ctx, cancel := context.WithTimeout(context.Background(), csrLimitWait) 
        
           defer cancel() 
        
           if lim.Wait(ctx) != nil { 
        
           	return nil, ErrRateLimited 
        
           }

More info also in:

consul/agent/consul/server_connect.go

Lines 64 to 69 in a432730

    
           // No limiter yet, or limit changed in CA config, reconfigure a new limiter. 
        
           // We use burst of 1 for a hard limit. Note that either bursting or waiting is 
        
           // necessary to get expected behavior in fact of random arrival times, but we 
        
           // don't need both and we use Wait with a small delay to smooth noise. See 
        
           // https://github.com/banks/sim-rate-limit-backoff/blob/master/README.md. 
        
           l.csrRateLimiter = rate.NewLimiter(limit, 1)

Possible complications

There may be internal code paths within Consul that write to raft which might not tolerate this error nicely. For example if we have internal leader goroutines that modify state and currently treat a failure to write raft as fatal and force leader stepdown or similar. We'd need to check that carefully and possible have an internal flag that forces internal writes to bypass the limit.

Alternatively we could only enforce the limit higher up for server RPCs so internal writes will always work.

Proposed Use

We should document the intended usage. It can be used to relieve write load on the cluster without taking it offline by limiting the overall write throughput. Operators could slowly lower the limit until the write throughput is low enough to allow unhealthy followers to catch up. If this is causing a significant error rate in downstream services, time can be minimised by only adding the rate limit for the time it takes to get a new follower caught up after a restart. Each follower can then be restarted in turn with the rate dropped only for the time needed for them to catch up, after a few such restarts, the increased raft_trailing_logs should take affect from one of the restarted servers and remove the need for further rate limiting or errors.

While this is not a perfect process, it's much more controlled and preferable to unknown whole-cluster downtime that is currently required since you can't keep a quorum of servers healthy while restarting to change the config to one that will work.

The text was updated successfully, but these errors were encountered:

boxofrad · 2022-03-22T20:22:38Z

We're currently pursuing the approach to rate-limiting discussed in #12593 ⭐

That said, having a way to control the overall write rate would be super useful (particularly for the "trailing logs" problem described in this issue) so we may well revisit this in the future!

banks added the theme/reliability label Jan 22, 2021

boxofrad mentioned this issue Mar 22, 2022

Server-side rate limiting #12593

Closed

hc-github-team-consul-core assigned jm96441n Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add emergency server write rate limit to allow controlled recovery from replication failure #9624

Add emergency server write rate limit to allow controlled recovery from replication failure #9624

banks commented Jan 22, 2021

boxofrad commented Mar 22, 2022

Add emergency server write rate limit to allow controlled recovery from replication failure #9624

Add emergency server write rate limit to allow controlled recovery from replication failure #9624

Comments

banks commented Jan 22, 2021

Background

Proposal

Smoothing

Possible complications

Proposed Use

boxofrad commented Mar 22, 2022