Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Bug]: After Milvus recovers from etcd pod failure chaos, the querynode got crash during verification test #37765

Open
1 task done
zhuwenxing opened this issue Nov 18, 2024 · 6 comments
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20241116-f7c7ac51-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-11-16T08:59:16.807Z] [2024-11-16 08:53:14 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ['int64 > 0', ['int64'], None, 180], kwargs: {'partition_name': '_default'} (api_request.py:62)

[2024-11-16T08:59:16.807Z] [2024-11-16 08:53:35 - ERROR - pymilvus.decorators]: RPC error: [query], <MilvusException: (code=503, message=failed to query: node offline[node=16]: channel not available[channel=by-dev-rootcoord-dml_5_453966585423805532v0])>, <Time:{'RPC start': '2024-11-16 08:53:14.099525', 'RPC error': '2024-11-16 08:53:35.115113'}> (decorators.py:140)

[2024-11-16T08:59:16.807Z] [2024-11-16 08:53:35 - ERROR - ci_test]: Traceback (most recent call last):

[2024-11-16T08:59:16.807Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-11-16T08:59:16.807Z]     res = func(*args, **_kwargs)

[2024-11-16T08:59:16.807Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-11-16T08:59:16.807Z]     return func(*arg, **kwargs)

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 1076, in query

[2024-11-16T08:59:16.807Z]     return conn.query(

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 141, in handler

[2024-11-16T08:59:16.807Z]     raise e from e

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 137, in handler

[2024-11-16T08:59:16.807Z]     return func(*args, **kwargs)

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 176, in handler

[2024-11-16T08:59:16.807Z]     return func(self, *args, **kwargs)

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 116, in handler

[2024-11-16T08:59:16.807Z]     raise e from e

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 86, in handler

[2024-11-16T08:59:16.807Z]     return func(*args, **kwargs)

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1542, in query

[2024-11-16T08:59:16.807Z]     check_status(response.status)

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 63, in check_status

[2024-11-16T08:59:16.807Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-11-16T08:59:16.807Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to query: node offline[node=16]: channel not available[channel=by-dev-rootcoord-dml_5_453966585423805532v0])>

[2024-11-16T08:59:16.807Z]  (api_request.py:45)

[2024-11-16T08:59:16.807Z] [2024-11-16 08:53:35 - ERROR - ci_test]: (api_response) : <MilvusException: (code=503, message=failed to query: node offline[node=16]: channel not available[channel=by-dev-rootcoord-dml_5_453966585423805532v0])> (api_request.py:46)

[2024-11-16T08:51:23.677Z] [2024-11-16 08:50:56 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=503, message=failed to search: service internal error: target version mismatch, collection: 453966585423805532, channel: by-dev-rootcoord-dml_6_453966585423805532v1,  current target version: 1731746590603413177, leader version: 0: channel not available[channel=by-dev-rootcoord-dml_6_453966585423805532v1])>, <Time:{'RPC start': '2024-11-16 08:50:35.438513', 'RPC error': '2024-11-16 08:50:56.459598'}> (decorators.py:140)

[2024-11-16T08:51:23.677Z] [2024-11-16 08:50:56 - ERROR - ci_test]: Traceback (most recent call last):

[2024-11-16T08:51:23.677Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-11-16T08:51:23.677Z]     res = func(*args, **_kwargs)

[2024-11-16T08:51:23.677Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-11-16T08:51:23.677Z]     return func(*arg, **kwargs)

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 801, in search

[2024-11-16T08:51:23.677Z]     resp = conn.search(

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 141, in handler

[2024-11-16T08:51:23.677Z]     raise e from e

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 137, in handler

[2024-11-16T08:51:23.677Z]     return func(*args, **kwargs)

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 176, in handler

[2024-11-16T08:51:23.677Z]     return func(self, *args, **kwargs)

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 116, in handler

[2024-11-16T08:51:23.677Z]     raise e from e

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 86, in handler

[2024-11-16T08:51:23.677Z]     return func(*args, **kwargs)

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 806, in search

[2024-11-16T08:51:23.677Z]     return self._execute_search(request, timeout, round_decimal=round_decimal, **kwargs)

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 747, in _execute_search

[2024-11-16T08:51:23.677Z]     raise e from e

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 736, in _execute_search

[2024-11-16T08:51:23.677Z]     check_status(response.status)

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 63, in check_status

[2024-11-16T08:51:23.677Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-11-16T08:51:23.677Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: service internal error: target version mismatch, collection: 453966585423805532, channel: by-dev-rootcoord-dml_6_453966585423805532v1,  current target version: 1731746590603413177, leader version: 0: channel not available[channel=by-dev-rootcoord-dml_6_453966585423805532v1])>

[2024-11-16T08:51:23.677Z]  (api_request.py:45)

pod info at 2024-11-16T08:49:59.699Z before verication


[2024-11-16T08:49:59.698Z] + kubectl get pods -o wide

[2024-11-16T08:49:59.699Z] + grep etcd-pod-failure-18570

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-0                                          1/1     Running            2 (8m47s ago)      32m     10.104.24.80    4am-node29   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-1                                          1/1     Running            2 (8m47s ago)      32m     10.104.15.206   4am-node20   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-2                                          1/1     Running            2 (8m47s ago)      32m     10.104.16.128   4am-node21   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-kafka-0                                    2/2     Running            1 (31m ago)        32m     10.104.15.207   4am-node20   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-kafka-1                                    2/2     Running            1 (31m ago)        32m     10.104.24.83    4am-node29   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-kafka-2                                    2/2     Running            1 (31m ago)        32m     10.104.16.131   4am-node21   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-kafka-exporter-65d959849b-f4g8c            1/1     Running            4 (31m ago)        32m     10.104.15.194   4am-node20   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-datanode-7986f669c-659qd            1/1     Running            8 (10m ago)        32m     10.104.13.193   4am-node16   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-datanode-7986f669c-dknzg            1/1     Running            8 (10m ago)        32m     10.104.20.205   4am-node22   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-indexnode-68fbf986f8-pzspb          1/1     Running            8 (10m ago)        32m     10.104.15.195   4am-node20   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-indexnode-68fbf986f8-vl22l          1/1     Running            8 (11m ago)        32m     10.104.30.67    4am-node38   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-indexnode-68fbf986f8-zl2q6          1/1     Running            8 (11m ago)        32m     10.104.24.76    4am-node29   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-mixcoord-666bfbdf6b-mmw59           1/1     Running            8 (11m ago)        32m     10.104.9.69     4am-node14   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-proxy-868d8d5966-mkqcm              1/1     Running            8 (10m ago)        32m     10.104.20.204   4am-node22   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-querynode-586f96ddd-9vv7t           1/1     Running            8 (11m ago)        32m     10.104.30.68    4am-node38   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-querynode-586f96ddd-ltbrv           0/1     CrashLoopBackOff   8 (4m39s ago)      32m     10.104.19.35    4am-node28   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-querynode-586f96ddd-xxxqw           1/1     Running            8 (11m ago)        32m     10.104.9.70     4am-node14   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-minio-0                                    1/1     Running            0                  32m     10.104.24.81    4am-node29   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-minio-1                                    1/1     Running            0                  32m     10.104.34.243   4am-node37   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-minio-2                                    1/1     Running            0                  32m     10.104.15.208   4am-node20   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-minio-3                                    1/1     Running            0                  32m     10.104.16.132   4am-node21   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-zookeeper-0                                1/1     Running            0                  32m     10.104.15.203   4am-node20   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-zookeeper-1                                1/1     Running            0                  32m     10.104.16.124   4am-node21   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-zookeeper-2                                1/1     Running            0                  32m     10.104.32.237   4am-node39   <none>           <none>

pod info after verification

+ kubectl get pods -o wide
 + grep etcd-pod-failure-18570
 etcd-pod-failure-18570-0                                          1/1     Running            2 (18m ago)        41m     10.104.24.80    4am-node29   <none>           <none>
 etcd-pod-failure-18570-1                                          1/1     Running            2 (18m ago)        41m     10.104.15.206   4am-node20   <none>           <none>
 etcd-pod-failure-18570-2                                          1/1     Running            2 (18m ago)        41m     10.104.16.128   4am-node21   <none>           <none>
 etcd-pod-failure-18570-kafka-0                                    2/2     Running            1 (41m ago)        41m     10.104.15.207   4am-node20   <none>           <none>
 etcd-pod-failure-18570-kafka-1                                    2/2     Running            1 (41m ago)        41m     10.104.24.83    4am-node29   <none>           <none>
 etcd-pod-failure-18570-kafka-2                                    2/2     Running            1 (41m ago)        41m     10.104.16.131   4am-node21   <none>           <none>
 etcd-pod-failure-18570-kafka-exporter-65d959849b-f4g8c            1/1     Running            4 (41m ago)        41m     10.104.15.194   4am-node20   <none>           <none>
 etcd-pod-failure-18570-milvus-datanode-7986f669c-659qd            1/1     Running            8 (20m ago)        41m     10.104.13.193   4am-node16   <none>           <none>
 etcd-pod-failure-18570-milvus-datanode-7986f669c-dknzg            1/1     Running            8 (20m ago)        41m     10.104.20.205   4am-node22   <none>           <none>
 etcd-pod-failure-18570-milvus-indexnode-68fbf986f8-pzspb          1/1     Running            8 (20m ago)        41m     10.104.15.195   4am-node20   <none>           <none>
 etcd-pod-failure-18570-milvus-indexnode-68fbf986f8-vl22l          1/1     Running            8 (20m ago)        41m     10.104.30.67    4am-node38   <none>           <none>
 etcd-pod-failure-18570-milvus-indexnode-68fbf986f8-zl2q6          1/1     Running            8 (20m ago)        41m     10.104.24.76    4am-node29   <none>           <none>
 etcd-pod-failure-18570-milvus-mixcoord-666bfbdf6b-mmw59           1/1     Running            8 (20m ago)        41m     10.104.9.69     4am-node14   <none>           <none>
 etcd-pod-failure-18570-milvus-proxy-868d8d5966-mkqcm              1/1     Running            8 (20m ago)        41m     10.104.20.204   4am-node22   <none>           <none>
 etcd-pod-failure-18570-milvus-querynode-586f96ddd-9vv7t           1/1     Running            8 (20m ago)        41m     10.104.30.68    4am-node38   <none>           <none>
 etcd-pod-failure-18570-milvus-querynode-586f96ddd-ltbrv           1/1     Running            9 (14m ago)        41m     10.104.19.35    4am-node28   <none>           <none>
 etcd-pod-failure-18570-milvus-querynode-586f96ddd-xxxqw           1/1     Running            8 (20m ago)        41m     10.104.9.70     4am-node14   <none>           <none>
 etcd-pod-failure-18570-minio-0                                    1/1     Running            0                  41m     10.104.24.81    4am-node29   <none>           <none>
 etcd-pod-failure-18570-minio-1                                    1/1     Running            0                  41m     10.104.34.243   4am-node37   <none>           <none>
 etcd-pod-failure-18570-minio-2                                    1/1     Running            0                  41m     10.104.15.208   4am-node20   <none>           <none>
 etcd-pod-failure-18570-minio-3                                    1/1     Running            0                  41m     10.104.16.132   4am-node21   <none>           <none>
 etcd-pod-failure-18570-zookeeper-0                                1/1     Running            0                  41m     10.104.15.203   4am-node20   <none>           <none>
 etcd-pod-failure-18570-zookeeper-1                                1/1     Running            0                  41m     10.104.16.124   4am-node21   <none>           <none>
 etcd-pod-failure-18570-zookeeper-2                                1/1     Running            0                  41m     10.104.32.237   4am-node39   <none>           <none>

Two issues exist with the querynode here:

  1. Why did one querynode not return to normal after the etcd pod failure chaos was eliminated?
  2. Why did the restart count increase by one after the verification test, and what caused this restart?

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/18570/pipeline

log:
artifacts-etcd-pod-failure-18570-server-logs.tar.gz

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 18, 2024
@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Nov 18, 2024
@zhuwenxing zhuwenxing added this to the 2.5.0 milestone Nov 18, 2024
@yanliang567
Copy link
Contributor

@liliu-z please help to take a look
this is the first time we see this error: ervice internal error: target version mismatch

/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 18, 2024
@congqixia
Copy link
Contributor

This is an error introduced in recent PR

if lview.TargetVersion <= 0 {
err := merr.WrapErrServiceInternal(fmt.Sprintf("target version mismatch, collection: %d, channel: %s, current target version: %v, leader version: %v",
lview.GetCollection(), lview.GetChannel(), currentTargetVersion, lview.TargetVersion))
view.UnServiceableError = err
// make dist handler pull next distribution until all delegator is serviceable
dh.lastUpdateTs = 0
collectionsToSync.Insert(lview.Collection)
log.Info("leader is not available due to target version not ready",
zap.Int64("collectionID", view.CollectionID),
zap.Int64("nodeID", view.ID),
zap.String("channel", view.Channel),
zap.Error(err))
}

I shall check with the author offline

@liliu-z
Copy link
Member

liliu-z commented Nov 18, 2024

/assign @congqixia

@weiliu1031
Copy link
Contributor

should be fixed by #37748

@weiliu1031
Copy link
Contributor

/assign

@weiliu1031
Copy link
Contributor

/assign @zhuwenxing

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants