You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
142239: raft: introduce term cache r=pav-kv,tbg a=hakuuww
This PR introduces a new sub data structure `termCache` to `raftLog`, which stores a suffix of the `raftLog` in a compressed representation, and helps getting a term of a particular raft entry.
`termCache` integrates tightly with `raftLog`, which means `termCache` replies on many assumptions guaranteed by `raftLog`, allowing concise implementation.
---
First of all, this is an example of what a raftLog may look like:
entryID: term/index
`[t5/10, t5/11, t5/12, t5/13, t6/14, t6/15, t6/16, t6/17, t6/18, t6/19, t7/20, t7/21, t7/22, t7/23, t7/24, t7/25, t7/26, t7/27, t8/28, t8/29, t8/30, t8/31, t10/32, t10/33, t10/34]`
properties of a raftLog:
entryID.Index are strictly increasing and continuous.
entryID.term are monotonically increasing, and can have gaps in between.
---
Those properties allow us to use term change points to express a long, continous raftLog.
The above example raftLog can be expressed as the following in our `termCache` representation:
(each entry is a term change point)
`[t5/10, t6/14, t7/20, t8/28, t10/32]`
In practice, a raftLog may be hundreds of entries long, but with only a few term changes in between.
So this compressed representation allows us to represent a long raftLog's entryIDs cheaply.
---
One immediate benefit of doing so is that there should no longer be any [raftEntry cache accesses or pebble calls when we want to know the term of a storage persisted entry](https://github.com/cockroachdb/cockroach/blob/e587879be8cd0f1ace03952decf6dda2573f0b56/pkg/kv/kvserver/logstore/logstore.go#L614-L639). (this is assuming term flips are rare, we can still have pebble access if we want to know the term of a very early entry that is more than `termCacheSize` terms old).
This helps avoid:
- unhelpful evictions on the raftEntry cache
- pebble access
Currently, both of the above scenarios doesn't incur a big cost, but we can still save a few
---
A second benefit is that: since we now keep a compressed representation of suffix of a raftLog, we can use this to carry more information in the raft leader probing follower process.
Currently, a raft message MsgAppResp{reject = true} from the follower only carries a single hintIndex and hintTerm.
With the term cache, we can include more information about the raftLog of a follower in its MsgAppResp with relatively low overhead. Which can be used to reduce the rtt involved in the leader/follower probing process.
Assuming we keep a few(say 4) term change points in the 'termCache', we can attach all 4 of those data points into our raft RPC messages. Which should be enough to cover the whole raftLog of a raft node.
The term cache covers entryIDs in the following range:
`[raftLog.first, raftLog.last]`
or something like:
`[entryID at commited index, raftLog.lastIndex]`
(in real implementation we also need to attach a lastIndex, which the term cache doesn't keep, but is kept in unstable/raftLog)
When receiving this `termCache` information from a `MsgAppResp{reject=true}` or `MsgVoteResp`, the leader can immediately know the accurate fork point of where to send the next MsgApp. instead of doing a few probing rtts to find the fork point.
(our current probing algorithm may take 2-3 rtts between Leader and follower to find a fork point in a bad raft case involving multiple leadership changes and partitions)
Part of #136296
Epic: None
Release note: None
143127: kvserver: add per-operation lock reliability settings r=yuzefovich a=stevendanna
Preserving unreplicated locks during split, merge, and lease transfers have different trade offs. For instance, during a split all lock updates are done in memory without any new replicated writes, whereas for merge and lease transfers requiring replicating locks through raft.
Here, we put the different operations under different settings since we may want to ship different defaults for the different operations.
Epic: none
Release note: None
Co-authored-by: Anthony Xu <anthony.xu@cockroachlabs.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
0 commit comments