Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Server load-aware testing #96

Open
thameem-abbas opened this issue Feb 17, 2025 · 0 comments
Open

Server load-aware testing #96

thameem-abbas opened this issue Feb 17, 2025 · 0 comments

Comments

@thameem-abbas
Copy link
Collaborator

When testing a vLLM server (single instance/fixed number of instances - no autoscaling), we test increasing concurrencies until we reach the performance peak. This happens at KV Cache depletion more often than not.

For example, say we are benchmarking a new GPU, we regularly set the concurrencies to powers of 2.

concurrency: [1,2,4,8,16,32,64,128,256,512]

In a scenario where we are close to KV cache depletion (95% utilization) at 256, any higher concurrency level will cause preemption and severely degrade the vllm server's performance which wastes test time and can be safely ignored from running.

If the KV Cache depletion happens at an earlier concurrency, the time saving is linearly scalable.

While there are models where we'd be limited by other parameters, some smart early backoff on the unnecessary tests can save us test-time.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant