Server load-aware testing #96

thameem-abbas · 2025-02-17T14:29:27Z

When testing a vLLM server (single instance/fixed number of instances - no autoscaling), we test increasing concurrencies until we reach the performance peak. This happens at KV Cache depletion more often than not.

For example, say we are benchmarking a new GPU, we regularly set the concurrencies to powers of 2.

concurrency: [1,2,4,8,16,32,64,128,256,512]

In a scenario where we are close to KV cache depletion (95% utilization) at 256, any higher concurrency level will cause preemption and severely degrade the vllm server's performance which wastes test time and can be safely ignored from running.

If the KV Cache depletion happens at an earlier concurrency, the time saving is linearly scalable.

While there are models where we'd be limited by other parameters, some smart early backoff on the unnecessary tests can save us test-time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server load-aware testing #96

Server load-aware testing #96

thameem-abbas commented Feb 17, 2025

Server load-aware testing #96

Server load-aware testing #96

Comments

thameem-abbas commented Feb 17, 2025