-
Notifications
You must be signed in to change notification settings - Fork 11.7k
RIP‐67 jRaft‐Controller Implemention
- Current State: Accept
- Authors: yulangz
- Shepherds: zhouxinyu,rongtong,fuyou
- Mailing List discussion: dev@rocketmq.apache.org
- Pull Request: https://github.com/apache/rocketmq/pull/7301
- Released: <relased_version>
- Google doc: https://docs.google.com/document/d/1mpzTv1vnWxQwPGsHj6Ng2fK9aL9f6MZFw7ZgvW5284o/edit
- Will we add a new module?
No new modules will be added, but a new implementation will be added for the Controller interface.
- Will we add new APIs?
No additions or modifications to any client-level and admin-tool APIs. There will be some new interfaces and APIs.
- Will we add a new feature?
No, JRaft Controller is a new implementation, but not a new future.
- Are there any problems of our current project?
Yes, there are some issues with the current DLedger Controller:
DLedger, as a Raft repository specifically designed for RocketMQ CommitLog, uses some specialized triggers for CommitLog. For example, DLedger does not implement a snapshot based log truncation function, but instead uses an expiration mechanism to directly discard logs that have exceeded their storage time. This scheme works well as a CommitLog repository, as logs that have exceeded their retention time can be simply discarded. However, when it comes to providing distributed consensus and consistency assurance for upper level state machines, this approach is not very suitable. After an unexpected machine failure, in order to restore the state machine in memory, it is necessary to apply Raft logs one by one to the upper layer. Therefore, all logs must be saved and the timeout deletion mechanism cannot be enabled. Without implementing the snapshot interface, DLedger's logs will grow infinitely, ultimately exceeding the machine's disk capacity. At the same time, the time for fault recovery will also be infinitely extended, which is unacceptable.
As shown in the following figure, the design of the DLedger Controller does not meet linear consistency:
The core function of the Controller is to manage the liveness status of the nodes and the SyncStateSet
to achieve automatic election of the Master.
Let's describe the workflow of the DLedger Controller using the example of an AlterSyncStateSet
request:
1.The Master Broker generates an AlterSyncStateSet
request, which includes the desired SyncStateSet
to switch to.
2.DLedgerController queries the current SyncStateSet
from ReplicasInfoManager and generates a response action based on it (e.g., adding/removing nodes from the SyncStateSet
).
3.DLedgerController submits this response action (event) to DLedgerRaft. DLedger is responsible for replicating this event to other Controller nodes. Once consensus is reached, the event is submitted to the DLedgerControllerStateMachine.
4.DLedgerControllerStateMachine modifies the data in ReplicasInfoManager based on the event.
5.The Broker's heartbeat reaches the BrokerLifecycleListener through a separate link, which is not routed through Raft, and is detected by ReplicasInfoManager.
The above workflow has a significant issue of not satisfying linear consistency. Since the processing of the request happens before Raft, a response action generated using potentially outdated data may occur. Let's illustrate this with an example:
Suppose there is a Broker Master A that triggers two consecutive AlterSyncStateSet
requests.
The initial SyncStateSet
in ReplicasInfoManager is {A, B}
.
For the two AlterSyncStateSet
requests, the first one is {A, B, C}
, and the second one is {A, B}
(removing node C).
Assume that the first request completes step 2, generating an event
to insert node C into the SyncStateSet
. It is currently in the process of Raft replication (step 3) and has not reached step 4 yet.
At this point, the second request arrives at the Controller. Since the SyncStateSet
is still {A, B}
, the Controller assumes that the SyncStateSet
has not changed and directly returns a Fail response to the requesting client. It does not proceed to submit to Raft (based on the code logic).
Finally, the first request completes step 3, and the data is broadcasted to all Controller nodes, eventually completing step 4 by inserting node C into the SyncStateSet
.
As a result, the final state of the SyncStateSet
is {A, B, C}
, while the expected state is {A, B}
.
The issue of inconsistent metadata within the Controller[Summer of code] Let controller become role state after append initial logs by hzh0425 · Pull Request #4442 · apache/rocketmq (github.com) stems from this problem.
Similarly, the heartbeat management, which operates independently of the Raft link, can also encounter problems.
- What can we benefit proposed changes?
Through this proposal, users can use the JRaft Controller to replace the DLedger Controller, which implements the snapshot function and can regularly create snapshots of the state machine and truncate logs to avoid infinite growth of Raft logs. At the same time, the JRaft Controller underwent refactoring during design to avoid the issue of linear inconsistency.
- What problem is this proposal designed to solve?
-
The problem of infinite growth of Raft logs caused by incomplete DLedger design.
-
Nonlinear consistency issues caused by incomplete design of DLedger Controller.
- What problem is this proposal NOT designed to solve?
This proposal does not propose a new multi copy storage mechanism, but rather an improvement on the existing architecture.
// 可选 jRaft、DLedger,默认 DLedger
controllerType=jRaft
// jRaft 相关
// 选举超时时间,默认 1 秒
jRaftElectionTimeoutMs=1000
// 进行 Snapshot 的间隔,默认 1 小时,此处建议更长的时间,如 1 天、3 天
jRaftSnapshotIntervalSecs=3600
// group id
jRaftGroupId=jRaft-Controller
// 本机 jraft 的地址
jRaftServerId=localhost:9880
// jraft 组的地址
jRaftInitConf=localhost:9880,localhost:9881,localhost:9882
// jRaft Controller 中,jRaft 与 提供给 Broker 的 RPCService 不共用套接字资源,下面设置的是 Controller 上监听 Broker RPC 的端口。注意 IP 与端口要与
// jRaftInitConf 一一对应。
jRaftControllerRPCAddr=localhost:9770,localhost:9771,localhost:9772
In JRaft Controller, all request responses are pushed down to the state machine layer. Additionally, the heartbeat of the Broker, as a request, goes through Raft replication and broadcasting before being submitted to the state machine.
This design ensures two points:
-
All request responses are processed at the state machine layer, eliminating the possibility of generating response actions using outdated data and ensuring linear consistency.
-
The liveness status of the nodes is reported to the state machine through Raft, allowing it to be recovered from Raft logs. By incorporating these mechanisms, JRaftController ensures that both request processing and node liveness status are handled consistently and reliably within the system.
In the implementation of heartbeats, JRaftController chooses to fix the timestamp of the heartbeat at the RaftBrokerLifecycleListener instead of checking the heartbeat time in the StateMachine. This ensures that the heartbeat time observed by each Controller node for the Broker remains consistent.
Perhaps other Raft libraries can be used to implement Controller.
Same as JRaft Controller.
This depends on the Raft library actually used, and different Raft libraries have different characteristics.
JRaft is mature enough, has undergone large-scale production testing, and is capable of achieving the functions we need.
Copyright © 2016~2022 The Apache Software Foundation.
- Home
- RocketMQ Improvement Proposal
- User Guide
- Community