You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When testing a vLLM server (single instance/fixed number of instances - no autoscaling), we test increasing concurrencies until we reach the performance peak. This happens at KV Cache depletion more often than not.
For example, say we are benchmarking a new GPU, we regularly set the concurrencies to powers of 2.
concurrency: [1,2,4,8,16,32,64,128,256,512]
In a scenario where we are close to KV cache depletion (95% utilization) at 256, any higher concurrency level will cause preemption and severely degrade the vllm server's performance which wastes test time and can be safely ignored from running.
If the KV Cache depletion happens at an earlier concurrency, the time saving is linearly scalable.
While there are models where we'd be limited by other parameters, some smart early backoff on the unnecessary tests can save us test-time.
The text was updated successfully, but these errors were encountered:
When testing a vLLM server (single instance/fixed number of instances - no autoscaling), we test increasing concurrencies until we reach the performance peak. This happens at KV Cache depletion more often than not.
For example, say we are benchmarking a new GPU, we regularly set the concurrencies to powers of 2.
concurrency: [1,2,4,8,16,32,64,128,256,512]
In a scenario where we are close to KV cache depletion (95% utilization) at 256, any higher concurrency level will cause preemption and severely degrade the vllm server's performance which wastes test time and can be safely ignored from running.
If the KV Cache depletion happens at an earlier concurrency, the time saving is linearly scalable.
While there are models where we'd be limited by other parameters, some smart early backoff on the unnecessary tests can save us test-time.
The text was updated successfully, but these errors were encountered: