Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

functional-tester: clean up, handle Operation_SIGQUIT_ETCD_AND_REMOVE_DATA #9548

Merged
merged 16 commits into from
Apr 9, 2018

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Apr 9, 2018

Adding Operation_SIGQUIT_ETCD_AND_REMOVE_DATA for membership reconfiguration tests #9150.

In following PRs, I am adding

// SIGQUIT_AND_REMOVE_ONE_FOLLOWER stops a randomly chosen follower
// (non-leader), deletes its data directories on disk, and removes
// this member from cluster (membership reconfiguration). On recovery,
// tester adds a new member, and this member joins the existing cluster
// with fresh data. It waits "failure-delay-ms" before recovering this
// failure. This simulates destroying one follower machine, where operator
// needs to add a new member from a fresh machine.
// The expected behavior is that a new member joins the existing cluster,
// and then each member continues to process client requests.
SIGQUIT_AND_REMOVE_ONE_FOLLOWER = 10;

// SIGQUIT_AND_REMOVE_ONE_FOLLOWER_UNTIL_TRIGGER_SNAPSHOT stops a randomly
// chosen follower, deletes its data directories on disk, and removes
// this member from cluster (membership reconfiguration). On recovery,
// tester adds a new member, and this member joins the existing cluster
// restart. On member remove, cluster waits until most up-to-date node
// (leader) applies the snapshot count of entries since the stop operation.
// This simulates destroying a leader machine, where operator needs to add
// a new member from a fresh machine.
// The expected behavior is that a new member joins the existing cluster,
// and receives a snapshot from the active leader. As always, after
// recovery, each member must be able to process client requests.
SIGQUIT_AND_REMOVE_ONE_FOLLOWER_UNTIL_TRIGGER_SNAPSHOT = 11;

// SIGQUIT_AND_REMOVE_LEADER stops the active leader node, deletes its
// data directories on disk, and removes this member from cluster.
// On recovery, tester adds a new member, and this member joins the
// existing cluster with fresh data. It waits "failure-delay-ms" before
// recovering this failure. This simulates destroying a leader machine,
// where operator needs to add a new member from a fresh machine.
// The expected behavior is that a new member joins the existing cluster,
// and then each member continues to process client requests.
SIGQUIT_AND_REMOVE_LEADER = 12;

// SIGQUIT_AND_REMOVE_LEADER_UNTIL_TRIGGER_SNAPSHOT stops the active leader,
// deletes its data directories on disk, and removes this member from
// cluster (membership reconfiguration). On recovery, tester adds a new
// member, and this member joins the existing cluster restart. On member
// remove, cluster waits until most up-to-date node (new leader) applies
// the snapshot count of entries since the stop operation. This simulates
// destroying a leader machine, where operator needs to add a new member
// from a fresh machine.
// The expected behavior is that on member remove, cluster elects a new
// leader, and a new member joins the existing cluster and receives a
// snapshot from the newly elected leader. As always, after recovery, each
// member must be able to process client requests.
SIGQUIT_AND_REMOVE_LEADER_UNTIL_TRIGGER_SNAPSHOT = 13;

// SIGQUIT_AND_REMOVE_QUORUM_AND_ALL first stops majority number of nodes,
// deletes data directories on disks, to make the whole cluster inoperable.
// Then tester can not even remove stopped members since quorum is lost.
// Let's assume 3-node cluster of node A, B, and C. One day, node A and B
// are destroyed and all their data are gone. The only viable solution is
// to recover from C's latest snapshot. When node A and B become
// unavailable, tester also kills the last node C, creates a single-node
// cluster from scratch, and adds more nodes to establish a multi-node
// cluster.
// The expected behavior is that etcd successfully recovers from such
// disastrous situation as only 1-node survives out of 3-node cluster,
// new members joins the existing cluster, and previous data from snapshot
// are still preserved after recovery process. As always, after recovery,
// each member must be able to process client requests.
SIGQUIT_AND_REMOVE_QUORUM_AND_ALL = 14;

gyuho added 7 commits April 7, 2018 10:00
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
gyuho added 2 commits April 9, 2018 10:16
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
@gyuho gyuho merged commit 10a51a3 into etcd-io:master Apr 9, 2018
@gyuho gyuho deleted the functional-tester branch April 9, 2018 17:20
gyuho added 7 commits April 9, 2018 10:22
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
For later "runner" cleanup

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Development

Successfully merging this pull request may close these issues.

1 participant