-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
storage: Large committed raft log leads to OOM #27804
Comments
The solution looks something like this:
This should be enough to recover a cluster that has gotten into this state. A longer-term fix needs testing and the memory limit should be passed in from the raft.Config struct. |
Next in this saga: Once the large raft log has been committed, if the range is underreplicated (as it is in the cluster that experienced this), the nodes will try to add a new replica using a snapshot containing the entire raft log (without pagination). This will crash the sending node faster than it has a chance to truncate its raft log. Currently trying to mitigate with this patch:
|
Up next: if a replica is rebalanced away while its raft log is large, the replica GC process will run out of memory. Again, this patch is a quick hack and a proper fix would involve finally replacing the
|
Picks up etcd-io/etcd#9982 and etcd-io/etcd#9985 (and no other changes to packages we use). Fixes cockroachdb#27983 Fixes cockroachdb#27804 Release note (bug fix): Additional fixes for out-of-memory errors caused by very large raft logs. Release note (performance improvement): Greatly improved performance when catching up followers that are behind when raft logs are large.
28511: vendor: Update etcd r=tschottdorf a=bdarnell Picks up etcd-io/etcd#9982 and etcd-io/etcd#9985 (and no other changes to packages we use). Fixes #27983 Fixes #27804 Release note (bug fix): Additional fixes for out-of-memory errors caused by very large raft logs. Release note (performance improvement): Greatly improved performance when catching up followers that are behind when raft logs are large. Co-authored-by: Ben Darnell <ben@bendarnell.com>
We have previously fixed issues (#26946) caused by large uncommitted raft logs. This has lead to a new failure mode as it is now possible for a very large raft log to become committed. This can cause OOM in a different part of the raft system. Fixing this requires upstream changes to raft.
The text was updated successfully, but these errors were encountered: