Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Bug]: [laion-1b] When there are tens of thousands of small segs, qn are oomkilled during target update #40615

Open
1 task done
ThreadDao opened this issue Mar 12, 2025 · 3 comments
Assignees
Labels
area/performance Performance issues kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: cardinal-milvus-io-2.5-32c00dbc1b-20250226
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server

    mixCoord:
      replicas: 1
      resources:
        limits:
          cpu: "4" 
          memory: 16Gi
        requests:
          cpu: "4" 
          memory: 16Gi
    queryNode:
      replicas: 5
      resources:
        limits:
          cpu: "32"
          memory: 128Gi
        requests:
          cpu: "16"
          memory: 64Gi

client

  • concurrent requests:

Image

argo link

  • a large mount of flush requests and 64 partitions (partition-key), 100k segments were generated. Over the next two days, tasks such as stats, index, and compaction were still ongoing. During this process, there is no new client request, and qn is oomkilled during the target updating

Image

Image

metrics link of laion1b-test-2

Expected Behavior

No response

Steps To Reproduce

Milvus Log

client log: /test/fouram/log/2025_03_10/laion1b-test-cron-1741615200-1_99682
server pods:

laion1b-test-2-etcd-0                                          1/1     Running     0                22d     10.104.25.47    4am-node30   <none>           <none>
laion1b-test-2-etcd-1                                          1/1     Running     0                22d     10.104.30.6     4am-node38   <none>           <none>
laion1b-test-2-etcd-2                                          1/1     Running     0                33h     10.104.34.27    4am-node37   <none>           <none>
laion1b-test-2-milvus-datanode-84b6f8bd58-c6b5d                1/1     Running     0                14d     10.104.32.186   4am-node39   <none>           <none>
laion1b-test-2-milvus-indexnode-5b6bf98b5d-479cp               1/1     Running     0                13d     10.104.15.145   4am-node20   <none>           <none>
laion1b-test-2-milvus-indexnode-5b6bf98b5d-hnlr5               1/1     Running     0                13d     10.104.33.149   4am-node36   <none>           <none>
laion1b-test-2-milvus-indexnode-5b6bf98b5d-l5pxg               1/1     Running     0                13d     10.104.20.12    4am-node22   <none>           <none>
laion1b-test-2-milvus-mixcoord-5d9c9bdb78-8n2qr                1/1     Running     0                14d     10.104.15.49    4am-node20   <none>           <none>
laion1b-test-2-milvus-proxy-68cccf9c7d-n88kv                   1/1     Running     0                14d     10.104.14.79    4am-node18   <none>           <none>
laion1b-test-2-milvus-querynode-1-8665544654-2sbvr             1/1     Running     24 (7h19m ago)   13d     10.104.17.186   4am-node23   <none>           <none>
laion1b-test-2-milvus-querynode-1-8665544654-2zjmv             1/1     Running     22 (33h ago)     13d     10.104.18.65    4am-node25   <none>           <none>
laion1b-test-2-milvus-querynode-1-8665544654-5l89j             1/1     Running     1 (38m ago)      10d     10.104.32.199   4am-node39   <none>           <none>
laion1b-test-2-milvus-querynode-1-8665544654-ts5rb             1/1     Running     19 (8h ago)      13d     10.104.33.198   4am-node36   <none>           <none>
laion1b-test-2-milvus-querynode-1-8665544654-wr6fd             1/1     Running     1 (7h19m ago)    3d12h   10.104.25.142   4am-node30   <none>           <none>
laion1b-test-2-pulsar-bookie-0                                 1/1     Running     0                50d     10.104.26.92    4am-node32   <none>           <none>
laion1b-test-2-pulsar-bookie-1                                 1/1     Running     0                50d     10.104.17.236   4am-node23   <none>           <none>
laion1b-test-2-pulsar-bookie-2                                 1/1     Running     0                50d     10.104.15.222   4am-node20   <none>           <none>
laion1b-test-2-pulsar-broker-0                                 1/1     Running     0                50d     10.104.9.171    4am-node14   <none>           <none>
laion1b-test-2-pulsar-proxy-0                                  1/1     Running     0                50d     10.104.14.92    4am-node18   <none>           <none>
laion1b-test-2-pulsar-zookeeper-0                              1/1     Running     0                50d     10.104.26.88    4am-node32   <none>           <none>
laion1b-test-2-pulsar-zookeeper-1                              1/1     Running     0                50d     10.104.32.220   4am-node39   <none>           <none>
laion1b-test-2-pulsar-zookeeper-2                              1/1     Running     0                50d     10.104.33.145   4am-node36   <none>           <none>

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 12, 2025
@ThreadDao ThreadDao added the severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. label Mar 12, 2025
@ThreadDao ThreadDao added this to the 2.5.7 milestone Mar 12, 2025
@dosubot dosubot bot added the area/performance Performance issues label Mar 12, 2025
Copy link

dosubot bot commented Mar 12, 2025

A similar issue was previously reported where query nodes were out-of-memory (OOM) killed due to inaccurate memory prediction based on container limits. However, no solution was found in the comments of that issue [1].

In your case, the OOM events during target updates could be related to the large number of small segments generated, which can lead to inefficiencies and increased memory usage. Milvus handles segment management through a SegmentManager that uses policies to manage segments, including sealing small segments into larger ones to optimize performance and resource utilization [2].

To address the OOM issue, consider the following:

  1. Segment Management: Review and adjust the segment sealing policies to ensure that small segments are consolidated more effectively. This can help reduce the overhead and memory usage associated with managing a large number of small segments.

  2. Resource Configuration: Ensure that the memory limits and requests for query nodes are appropriately configured. The default memory limits and requests for query nodes can be adjusted in the configuration file to better match the workload demands [3].

  3. Compaction and Indexing: Monitor and manage background tasks like compaction and indexing, as they can contribute to high memory usage during target updates [4].

These steps may help mitigate the OOM issues you're experiencing during target updates.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@ThreadDao
Copy link
Contributor Author

  • search failed
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:575 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:589 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:105 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:223 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).SearchSegments
/go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:193 github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).SearchSegments
/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator.go:325 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).search.func3
/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator.go:701 github.com/milvus-io/milvus/internal/querynodev2/delegator.executeSubTasks[...].func1
/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.0.linux-amd64/src/runtime/asm_amd64.s:1695 runtime.goexit: rpc error: code = Unknown desc = node not match[expectedNodeID=3831][actualNodeID=3839] (run.go:323:func1)
[2025-03-12 11:24:05.312][   ERROR] - fail to search on QueryNode 3834: stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/v2/tracer.StackTrace

@yanliang567
Copy link
Contributor

/assign @weiliu1031

/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 12, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/performance Performance issues kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants