storage: Large committed raft log leads to OOM #27804

bdarnell · 2018-07-20T14:52:18Z

We have previously fixed issues (#26946) caused by large uncommitted raft logs. This has lead to a new failure mode as it is now possible for a very large raft log to become committed. This can cause OOM in a different part of the raft system. Fixing this requires upstream changes to raft.

bdarnell · 2018-07-20T15:27:42Z

The solution looks something like this:

diff --git a/github.com/coreos/etcd/raft/log.go b/github.com/coreos/etcd/raft/log.go
index c3036d3..7a81eb3 100644
--- a/github.com/coreos/etcd/raft/log.go
+++ b/github.com/coreos/etcd/raft/log.go
@@ -139,7 +139,7 @@ func (l *raftLog) unstableEntries() []pb.Entry {
 func (l *raftLog) nextEnts() (ents []pb.Entry) {
        off := max(l.applied+1, l.firstIndex())
        if l.committed+1 > off {
-               ents, err := l.slice(off, l.committed+1, noLimit)
+               ents, err := l.slice(off, l.committed+1, 100*1024*1024)
                if err != nil {
                        l.logger.Panicf("unexpected error when getting unapplied entries (%v)", err)
                }
diff --git a/github.com/coreos/etcd/raft/node.go b/github.com/coreos/etcd/raft/node.go
index f3ba250..c5cd3b1 100644
--- a/github.com/coreos/etcd/raft/node.go
+++ b/github.com/coreos/etcd/raft/node.go
@@ -517,6 +517,12 @@ func newReady(r *raft, prevSoftSt *SoftState, prevHardSt pb.HardState) Ready {
                rd.SoftState = softSt
        }
        if hardSt := r.hardState(); !isHardStateEqual(hardSt, prevHardSt) {
+               if len(rd.CommittedEntries) > 0 {
+                       lastCommit := rd.CommittedEntries[len(rd.CommittedEntries)-1]
+                       if hardSt.Commit > lastCommit.Index {
+                               hardSt.Commit = lastCommit.Index
+                       }
+               }
                rd.HardState = hardSt
        }
        if r.raftLog.unstable.snapshot != nil {

This should be enough to recover a cluster that has gotten into this state. A longer-term fix needs testing and the memory limit should be passed in from the raft.Config struct.

bdarnell · 2018-07-27T17:38:52Z

Next in this saga: Once the large raft log has been committed, if the range is underreplicated (as it is in the cluster that experienced this), the nodes will try to add a new replica using a snapshot containing the entire raft log (without pagination). This will crash the sending node faster than it has a chance to truncate its raft log. Currently trying to mitigate with this patch:

diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index 6f0d90f..e6769a9 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -3712,10 +3712,15 @@ func sendSnapshot(
        firstIndex := header.State.TruncatedState.Index + 1
        endIndex := snap.RaftSnap.Metadata.Index + 1
        logEntries := make([][]byte, 0, endIndex-firstIndex)
+       logEntriesSize := 0
 
        scanFunc := func(kv roachpb.KeyValue) (bool, error) {
                bytes, err := kv.Value.GetBytes()
                if err == nil {
+                       logEntriesSize += len(bytes)
+                       if logEntriesSize > 100 * 1024 * 1024 {
+                               return false, errors.New("aborting snapshot because raft log too large")
+                       }
                        logEntries = append(logEntries, bytes)
                }
                return false, err

bdarnell · 2018-07-30T01:47:14Z

Up next: if a replica is rebalanced away while its raft log is large, the replica GC process will run out of memory. Again, this patch is a quick hack and a proper fix would involve finally replacing the rditer.MakeAllKeyRanges interface.

diff --git a/pkg/storage/replica.go b/pkg/storage/replica.go
index ba63fc7..6bea87d 100644
--- a/pkg/storage/replica.go
+++ b/pkg/storage/replica.go
@@ -783,11 +783,14 @@ func (r *Replica) destroyDataRaftMuLocked(
        defer batch.Close()
 
        ms := r.GetMVCCStats()
+       r.mu.Lock()
+       logCount := int64(r.mu.lastIndex - r.mu.state.TruncatedState.Index)
+       r.mu.Unlock()
 
        // NB: this uses the local descriptor instead of the consistent one to match
        // the data on disk.
        desc := r.Desc()
-       if err := clearRangeData(ctx, desc, ms.KeyCount, r.store.Engine(), batch); err != nil {
+       if err := clearRangeData(ctx, desc, ms.KeyCount, logCount, r.store.Engine(), batch); err != nil {
                return err
        }
        clearTime := timeutil.Now()
diff --git a/pkg/storage/replica_raftstorage.go b/pkg/storage/replica_raftstorage.go
index 95d4393..ef80aee 100644
--- a/pkg/storage/replica_raftstorage.go
+++ b/pkg/storage/replica_raftstorage.go
@@ -638,7 +638,7 @@ const (
 func clearRangeData(
        ctx context.Context,
        desc *roachpb.RangeDescriptor,
-       keyCount int64,
+       keyCount, logCount int64,
        eng engine.Engine,
        batch engine.Batch,
 ) error {
@@ -653,20 +653,30 @@ func clearRangeData(
        // perhaps we should fix RocksDB to handle large numbers of tombstones in an
        // sstable better.
        const clearRangeMinKeys = 64
-       const metadataRanges = 2
-       for i, keyRange := range rditer.MakeAllKeyRanges(desc) {
-               // Metadata ranges always have too few keys to justify ClearRange (see
-               // above), but the data range's key count needs to be explicitly checked.
-               var err error
-               if i >= metadataRanges && keyCount >= clearRangeMinKeys {
-                       err = batch.ClearRange(keyRange.Start, keyRange.End)
-               } else {
-                       err = batch.ClearIterRange(iter, keyRange.Start, keyRange.End)
-               }
-               if err != nil {
-                       return err
+       keyRanges := rditer.MakeAllKeyRanges(desc)
+
+       clearKeyRange := func(kr rditer.KeyRange, count int64) error {
+               if count > clearRangeMinKeys {
+                       return batch.ClearRange(kr.Start, kr.End)
                }
+               return batch.ClearIterRange(iter, kr.Start, kr.End)
        }
+       // rangeid key span, mainly holds the raft log (and the abort span,
+       // not accounted for here)
+       if err := clearKeyRange(keyRanges[0], logCount); err != nil {
+               return err
+       }
+       // range-local keys span, mainly holds transaction records. No good
+       // accounting for this, so we assume it is too small to need
+       // ClearRange.
+       if err := clearKeyRange(keyRanges[1], 0); err != nil {
+               return err
+       }
+       // data key span.
+       if err := clearKeyRange(keyRanges[2], keyCount); err != nil {
+               return err
+       }
+
        return nil
 }
 
@@ -688,6 +698,7 @@ func (r *Replica) applySnapshot(
        r.mu.RLock()
        replicaID := r.mu.replicaID
        keyCount := r.mu.state.Stats.KeyCount
+       logCount := int64(r.mu.lastIndex - r.mu.state.TruncatedState.Index)
        r.mu.RUnlock()
 
        snapType := inSnap.snapType
@@ -764,7 +775,7 @@ func (r *Replica) applySnapshot(
        // Delete everything in the range and recreate it from the snapshot.
        // We need to delete any old Raft log entries here because any log entries
        // that predate the snapshot will be orphaned and never truncated or GC'd.
-       if err := clearRangeData(ctx, s.Desc, keyCount, r.store.Engine(), batch); err != nil {
+       if err := clearRangeData(ctx, s.Desc, keyCount, logCount, r.store.Engine(), batch); err != nil {
                return err
        }
        stats.clear = timeutil.Now()

Picks up etcd-io/etcd#9982 and etcd-io/etcd#9985 (and no other changes to packages we use). Fixes cockroachdb#27983 Fixes cockroachdb#27804 Release note (bug fix): Additional fixes for out-of-memory errors caused by very large raft logs. Release note (performance improvement): Greatly improved performance when catching up followers that are behind when raft logs are large.

28511: vendor: Update etcd r=tschottdorf a=bdarnell Picks up etcd-io/etcd#9982 and etcd-io/etcd#9985 (and no other changes to packages we use). Fixes #27983 Fixes #27804 Release note (bug fix): Additional fixes for out-of-memory errors caused by very large raft logs. Release note (performance improvement): Greatly improved performance when catching up followers that are behind when raft logs are large. Co-authored-by: Ben Darnell <ben@bendarnell.com>

bdarnell self-assigned this Jul 20, 2018

tbg added this to the 2.1 milestone Jul 22, 2018

bdarnell mentioned this issue Jul 27, 2018

storage: Raft log catching up extremely slowly #27983

Closed

bdarnell mentioned this issue Aug 13, 2018

vendor: Update etcd #28511

Merged

craig bot closed this as completed in #28511 Aug 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Large committed raft log leads to OOM #27804

storage: Large committed raft log leads to OOM #27804

bdarnell commented Jul 20, 2018

bdarnell commented Jul 20, 2018

bdarnell commented Jul 27, 2018

bdarnell commented Jul 30, 2018

storage: Large committed raft log leads to OOM #27804

storage: Large committed raft log leads to OOM #27804

Comments

bdarnell commented Jul 20, 2018

bdarnell commented Jul 20, 2018

bdarnell commented Jul 27, 2018

bdarnell commented Jul 30, 2018