Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Bug]: [streaming] The streaming node was oomkilled when inserting into a collection with partition-key and multi shards #40592

Open
1 task done
ThreadDao opened this issue Mar 12, 2025 · 3 comments
Assignees
Labels
area/performance Performance issues feature/streaming node streaming node feature kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20250311-0698d04f-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server config

    streamingNode:
      replicas: 2
      resources:
        limits:
          cpu: "4" 
          memory: 16Gi
        requests:
          cpu: "2" 
          memory: 8Gi
  config:
    dataCoord:
      enableActiveStandby: true
    indexCoord:
      enableActiveStandby: true
    log:
      level: debug
    queryCoord:
      enableActiveStandby: true
    rootCoord:
      enableActiveStandby: true
    trace:
      exporter: jaeger
      jaeger:
        url: http://tempo-distributor.tempo:14268/api/traces
      sampleFraction: 1

client

  1. create a collection with 16 shards and enable partition-key field, 16 partitions
{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'int64_pk_5b', 'description': '', 'type': <DataType.INT64: 5>, 'is_partition_key': True}, {'name': 'varchar_caption', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'varchar_NSFW', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'float64_similarity', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'int64_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'json_1', 'description': '', 'type': <DataType.JSON: 23>}, {'name': 'array_varchar_1', 'description': '', 'type': <DataType.ARRAY: 22>, 'params': {'max_length': 100, 'max_capacity': 10}, 'element_type': <DataType.VARCHAR: 21>}, {'name': 'varchar_md5', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}], 'enable_dynamic_field': True}
  1. Sequential insertion, 50000 entities per batch
  2. two streaming node restarted due to oom after inserted 6685000
[2025-03-11 13:22:12,398 -  INFO - fouram]: [Time] Collection.insert run in 0.5081s (api_request.py:49)
[2025-03-11 13:22:12,401 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_C3dUVVpI): 6061214 (base.py:535)
[2025-03-11 13:22:12,413 -  INFO - fouram]: [Base] Start inserting, ids: 6685000 - 6689999, data size: 10,000,000 (base.py:366)
2025-03-11 13:22:38,590 [ERROR][handler]: RPC error: [batch_insert], <MilvusException: (code=65535, message=code: STREAMING_CODE_TRANSACTION_EXPIRED, cause: not found in manager)>, <Time:{'RPC start': '2025-03-11 13:22:12.495315', 'RPC error': '2025-03-11 13:22:38.590384'}> (decorators.py:140)
[2025-03-11 13:22:38,591 - ERROR - fouram]: (api_response) : [Collection.insert] <MilvusException: (code=65535, message=code: STREAMING_CODE_TRANSACTION_EXPIRED, cause: not found in manager)>, [requestId: d7a53ee2-fe7b-11ef-8c40-4a2e0f5f6cde] (api_request.py:57)
  1. sn memory usage
    Image

Expected Behavior

No response

Steps To Reproduce

https://argo-workflows.zilliz.cc/archived-workflows/qa/11ba0047-9bd9-452b-912f-18600b6cfc51?nodeId=zong-sn-stable-1-2497234528

Milvus Log

pods:

zong-sn-op-7-5639-etcd-0                                          1/1     Running     0              13h     10.104.24.247   4am-node29   <none>           <none>
zong-sn-op-7-5639-etcd-1                                          1/1     Running     0              13h     10.104.34.247   4am-node37   <none>           <none>
zong-sn-op-7-5639-etcd-2                                          1/1     Running     0              13h     10.104.27.137   4am-node31   <none>           <none>
zong-sn-op-7-5639-milvus-datanode-565c8c4bc5-dznww                1/1     Running     0              13h     10.104.6.38     4am-node13   <none>           <none>
zong-sn-op-7-5639-milvus-indexnode-78ddbb9df5-h9g5r               1/1     Running     0              13h     10.104.19.218   4am-node28   <none>           <none>
zong-sn-op-7-5639-milvus-indexnode-78ddbb9df5-z6dzn               1/1     Running     0              13h     10.104.30.132   4am-node38   <none>           <none>
zong-sn-op-7-5639-milvus-mixcoord-5c479c8849-8vbm4                1/1     Running     0              13h     10.104.16.183   4am-node21   <none>           <none>
zong-sn-op-7-5639-milvus-proxy-769bbd7c8-lfqj5                    1/1     Running     0              13h     10.104.27.146   4am-node31   <none>           <none>
zong-sn-op-7-5639-milvus-querynode-0-644d4d878c-6864m             1/1     Running     0              13h     10.104.24.6     4am-node29   <none>           <none>
zong-sn-op-7-5639-milvus-querynode-0-644d4d878c-djjqk             1/1     Running     0              13h     10.104.20.52    4am-node22   <none>           <none>
zong-sn-op-7-5639-milvus-querynode-0-644d4d878c-qvxnf             1/1     Running     0              13h     10.104.17.166   4am-node23   <none>           <none>
zong-sn-op-7-5639-milvus-querynode-0-644d4d878c-rmds4             1/1     Running     0              13h     10.104.6.40     4am-node13   <none>           <none>
zong-sn-op-7-5639-milvus-streamingnode-7b96b88748-fgbmr           1/1     Running     1 (13h ago)    13h     10.104.17.165   4am-node23   <none>           <none>
zong-sn-op-7-5639-milvus-streamingnode-7b96b88748-kfvrd           1/1     Running     1 (13h ago)    13h     10.104.24.5     4am-node29   <none>           <none>
zong-sn-op-7-5639-minio-0                                         1/1     Running     0              13h     10.104.24.248   4am-node29   <none>           <none>
zong-sn-op-7-5639-minio-1                                         1/1     Running     0              13h     10.104.23.38    4am-node27   <none>           <none>
zong-sn-op-7-5639-minio-2                                         1/1     Running     0              13h     10.104.34.248   4am-node37   <none>           <none>
zong-sn-op-7-5639-minio-3                                         1/1     Running     0              13h     10.104.27.138   4am-node31   <none>           <none>
zong-sn-op-7-5639-pulsar-bookie-0                                 1/1     Running     0              13h     10.104.19.210   4am-node28   <none>           <none>
zong-sn-op-7-5639-pulsar-bookie-1                                 1/1     Running     0              13h     10.104.24.251   4am-node29   <none>           <none>
zong-sn-op-7-5639-pulsar-bookie-2                                 1/1     Running     0              13h     10.104.16.178   4am-node21   <none>           <none>
zong-sn-op-7-5639-pulsar-bookie-init-cmxdg                        0/1     Completed   0              13h     10.104.21.234   4am-node24   <none>           <none>
zong-sn-op-7-5639-pulsar-broker-0                                 1/1     Running     0              13h     10.104.30.125   4am-node38   <none>           <none>
zong-sn-op-7-5639-pulsar-proxy-0                                  1/1     Running     0              13h     10.104.27.134   4am-node31   <none>           <none>
zong-sn-op-7-5639-pulsar-pulsar-init-hkwqb                        0/1     Completed   0              13h     10.104.34.243   4am-node37   <none>           <none>
zong-sn-op-7-5639-pulsar-recovery-0                               1/1     Running     0              13h     10.104.9.223    4am-node14   <none>           <none>
zong-sn-op-7-5639-pulsar-zookeeper-0                              1/1     Running     0              13h     10.104.34.249   4am-node37   <none>           <none>
zong-sn-op-7-5639-pulsar-zookeeper-1                              1/1     Running     0              13h     10.104.21.243   4am-node24   <none>           <none>
zong-sn-op-7-5639-pulsar-zookeeper-2                              1/1     Running     0              13h     10.104.16.180   4am-node21   <none>           <none>

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 12, 2025
@ThreadDao ThreadDao added the severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. label Mar 12, 2025
@ThreadDao ThreadDao added this to the 2.6.0 milestone Mar 12, 2025
@chyezh
Copy link
Contributor

chyezh commented Mar 12, 2025

may be fixed by #40555

@dosubot dosubot bot added area/performance Performance issues feature/streaming node streaming node feature labels Mar 12, 2025
@yanliang567 yanliang567 removed their assignment Mar 12, 2025
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 12, 2025
@chyezh
Copy link
Contributor

chyezh commented Mar 12, 2025

After pchannel-level flusher enhancement #39275, the sync manager will be start up at pchannel level.
So the sync policy can not see the whole view of writebuffer at same streaming node.
Should be fixed by #40606

sre-ci-robot pushed a commit that referenced this issue Mar 12, 2025
issue: #40592

Signed-off-by: chyezh <chyezh@outlook.com>
@chyezh
Copy link
Contributor

chyezh commented Mar 13, 2025

/assign @ThreadDao
please help to verify after commit 5735c3ef199f76cfcc1f4161840d27fe2e89e4c0 at master.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/performance Performance issues feature/streaming node streaming node feature kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants