[Bug]: Etcd Pods Entering CrashLoopBackOff After Restart, Leading to Milvus Cluster Degradation #40575

aashishsingh2803 · 2025-03-11T13:05:18Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.5.5
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

While testing the reliability of the Milvus cluster, I forcefully terminated all five etcd pods at the same time. After the termination, new pods were automatically recreated; however, most of them entered a continuous CrashLoopBackOff state, resulting in cluster degradation.

Upon investigating the etcd cluster state, we found that four out of five nodes were no longer part of the etcd cluster, causing significant instability.

Please investigate this issue and provide guidance on resolving it. If additional details are required, let us know.

Expected Behavior

No response

Steps To Reproduce

Kill all etcd pods at the same time, kill the pods using k9s (ctrl + k) 
Wait for new etcd pods to be created.
Observe that most of the new pods enter a CrashLoopBackOff state.
Check etcd cluster health (`etcdctl endpoint status --cluster`, `etcdctl member list`).

Milvus Log

Attached logs of the etcd cluster
etcd-milvusetcdupgrade1-0.log
etcd-milvusetcdupgrade1-1.log
etcd-milvusetcdupgrade1-2.log
etcd-milvusetcdupgrade1-3.log
etcd-milvusetcdupgrade1-4.log

Anything else?

No response

dosubot · 2025-03-11T13:07:28Z

@aashishsingh2803 Hello there! I'm here to assist you with any bugs, questions, or contribution-related matters. Let's work together to resolve the issue you're facing.

A similar issue was previously reported, where etcd pods entered a CrashLoopBackOff state after a forced restart, leading to Milvus cluster degradation. Unfortunately, no solution was found in the comments of that issue ^[1].

To resolve this issue, you can follow these steps:

If multiple pods keep crashing while etcd-0 is still running:
- Scale down the etcd stateful set to one replica:
```
kubectl scale sts <etcd-sts> --replicas=1
```
- Delete the persistent volume claims (PVC) for etcd-1 and etcd-2.
- Scale the etcd stateful set back to three replicas:
```
kubectl scale sts <etcd-sts> --replicas=3
```
If all pods crash:
- Copy the /bitnami/etcd/data/member/snap/db file.
- Use https://github.com/etcd-io/bbolt to modify the database data.
- Back up the data in the key bucket.
- Run the following commands:
```
kubectl scale sts <etcd-sts> --replicas=0
# delete the pvc for etcd-0, etcd-1, etcd-2
kubectl scale sts <etcd-sts> --replicas=1
# restore the backup data
```

Additionally, ensure that you follow the proper shutdown sequence in the future: stop the Milvus standalone instance first, and then stop the etcd pods. This allows Milvus to properly persist its final state to etcd before etcd is shut down ^[2]^[3].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

xiaofan-luan · 2025-03-12T00:19:34Z

"level":"warn","ts":"2025-03-11T12:17:08.160747Z","caller":"etcdserver/server.go:1148","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2025-03-11T12:17:08.160777Z","caller":"etcdserver/server.go:1149","msg":"data-dir used by this member must be removed"}
{"level":"warn","ts":"2025-03-11T12:17:08.160801Z","caller":"etcdserver/server.go:2139","msg":"failed to publish local member to cluster through raft","local-member-id":"6453044468ed4c6f","local-member-attributes":"{Name:etcd-milvusetcdupgrade1-2 ClientURLs:[http://etcd-milvusetcdupgrade1-2.etcd-milvusetcdupgrade1-headless.milvus.svc.cluster.local:2379]}","request-path":"/0/members/6453044468ed4c6f/attributes","publish-timeout":"10s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2025-03-11T12:17:08.160824Z","caller":"etcdserver/server.go:2129","msg":"stopped publish because server is stopped","local-member-id":"6453044468ed4c6f","local-member-attributes":"{Name:etcd-milvusetcdupgrade1-2 ClientURLs:[http://etcd-milvusetcdupgrade1-2.etcd-milvusetcdupgrade1-headless.milvus.svc.cluster.local:2379]}","publish-timeout":"10s","error":"etcdserver: server stopped"}

xiaofan-luan · 2025-03-12T00:22:15Z

@aashishsingh2803
can you make sure the etcd data dir is still there?
My guess is that the etcd has been moved by reason

yanliang567 · 2025-03-12T12:43:04Z

/assign @aashishsingh2803
/unassign

aashishsingh2803 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 11, 2025

aashishsingh2803 assigned yanliang567 Mar 11, 2025

dosubot bot added area/deployment any issues that related to deployment component/etcd severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Mar 11, 2025

sre-ci-robot assigned aashishsingh2803 and unassigned yanliang567 Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Etcd Pods Entering CrashLoopBackOff After Restart, Leading to Milvus Cluster Degradation #40575

[Bug]: Etcd Pods Entering CrashLoopBackOff After Restart, Leading to Milvus Cluster Degradation #40575

aashishsingh2803 commented Mar 11, 2025

dosubot bot commented Mar 11, 2025

xiaofan-luan commented Mar 12, 2025

xiaofan-luan commented Mar 12, 2025

yanliang567 commented Mar 12, 2025

[Bug]: Etcd Pods Entering CrashLoopBackOff After Restart, Leading to Milvus Cluster Degradation #40575

[Bug]: Etcd Pods Entering CrashLoopBackOff After Restart, Leading to Milvus Cluster Degradation #40575

Comments

aashishsingh2803 commented Mar 11, 2025

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

dosubot bot commented Mar 11, 2025

xiaofan-luan commented Mar 12, 2025

xiaofan-luan commented Mar 12, 2025

yanliang567 commented Mar 12, 2025