Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Bug]: Etcd Pods Entering CrashLoopBackOff After Restart, Leading to Milvus Cluster Degradation #40575

Open
1 task done
aashishsingh2803 opened this issue Mar 11, 2025 · 4 comments
Assignees
Labels
area/deployment any issues that related to deployment component/etcd kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@aashishsingh2803
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.5.5
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

While testing the reliability of the Milvus cluster, I forcefully terminated all five etcd pods at the same time. After the termination, new pods were automatically recreated; however, most of them entered a continuous CrashLoopBackOff state, resulting in cluster degradation.

Upon investigating the etcd cluster state, we found that four out of five nodes were no longer part of the etcd cluster, causing significant instability.

Please investigate this issue and provide guidance on resolving it. If additional details are required, let us know.

Expected Behavior

No response

Steps To Reproduce

Kill all etcd pods at the same time, kill the pods using k9s (ctrl + k) 
Wait for new etcd pods to be created.
Observe that most of the new pods enter a CrashLoopBackOff state.
Check etcd cluster health (`etcdctl endpoint status --cluster`, `etcdctl member list`).

Milvus Log

Attached logs of the etcd cluster
etcd-milvusetcdupgrade1-0.log
etcd-milvusetcdupgrade1-1.log
etcd-milvusetcdupgrade1-2.log
etcd-milvusetcdupgrade1-3.log
etcd-milvusetcdupgrade1-4.log

Anything else?

No response

@aashishsingh2803 aashishsingh2803 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 11, 2025
Copy link

dosubot bot commented Mar 11, 2025

@aashishsingh2803 Hello there! I'm here to assist you with any bugs, questions, or contribution-related matters. Let's work together to resolve the issue you're facing.

A similar issue was previously reported, where etcd pods entered a CrashLoopBackOff state after a forced restart, leading to Milvus cluster degradation. Unfortunately, no solution was found in the comments of that issue [1].

To resolve this issue, you can follow these steps:

  1. If multiple pods keep crashing while etcd-0 is still running:

    • Scale down the etcd stateful set to one replica:
      kubectl scale sts <etcd-sts> --replicas=1
    • Delete the persistent volume claims (PVC) for etcd-1 and etcd-2.
    • Scale the etcd stateful set back to three replicas:
      kubectl scale sts <etcd-sts> --replicas=3
  2. If all pods crash:

    • Copy the /bitnami/etcd/data/member/snap/db file.
    • Use https://github.com/etcd-io/bbolt to modify the database data.
    • Back up the data in the key bucket.
    • Run the following commands:
      kubectl scale sts <etcd-sts> --replicas=0
      # delete the pvc for etcd-0, etcd-1, etcd-2
      kubectl scale sts <etcd-sts> --replicas=1
      # restore the backup data

Additionally, ensure that you follow the proper shutdown sequence in the future: stop the Milvus standalone instance first, and then stop the etcd pods. This allows Milvus to properly persist its final state to etcd before etcd is shut down [2][3].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@dosubot dosubot bot added area/deployment any issues that related to deployment component/etcd severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Mar 11, 2025
@xiaofan-luan
Copy link
Collaborator

"level":"warn","ts":"2025-03-11T12:17:08.160747Z","caller":"etcdserver/server.go:1148","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2025-03-11T12:17:08.160777Z","caller":"etcdserver/server.go:1149","msg":"data-dir used by this member must be removed"}
{"level":"warn","ts":"2025-03-11T12:17:08.160801Z","caller":"etcdserver/server.go:2139","msg":"failed to publish local member to cluster through raft","local-member-id":"6453044468ed4c6f","local-member-attributes":"{Name:etcd-milvusetcdupgrade1-2 ClientURLs:[http://etcd-milvusetcdupgrade1-2.etcd-milvusetcdupgrade1-headless.milvus.svc.cluster.local:2379]}","request-path":"/0/members/6453044468ed4c6f/attributes","publish-timeout":"10s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2025-03-11T12:17:08.160824Z","caller":"etcdserver/server.go:2129","msg":"stopped publish because server is stopped","local-member-id":"6453044468ed4c6f","local-member-attributes":"{Name:etcd-milvusetcdupgrade1-2 ClientURLs:[http://etcd-milvusetcdupgrade1-2.etcd-milvusetcdupgrade1-headless.milvus.svc.cluster.local:2379]}","publish-timeout":"10s","error":"etcdserver: server stopped"}

@xiaofan-luan
Copy link
Collaborator

@aashishsingh2803
can you make sure the etcd data dir is still there?
My guess is that the etcd has been moved by reason

@yanliang567
Copy link
Contributor

/assign @aashishsingh2803
/unassign

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 12, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/deployment any issues that related to deployment component/etcd kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants