-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
FEAT: Xavier: Share KV cache between VLLM replicas #2732
FEAT: Xavier: Share KV cache between VLLM replicas #2732
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
A Corner Case:
When transferring blocks, the block may be evicted or replaced by a new one. It's better to use a block hash during transfers. If the block is evicted or there is a block hash mismatch, we can simply handle it as a cache miss. |
How can we produce the corner case? |
We can add mock logic to produce it. For example, call evict or modify (to simulate the block replacement) on the model block while querying the block. |
OK, how about opening a new issue to track this? |
Let me open an issue. |
Xavier: Share KV cache between VLLM replicas
Naming
It is derived from Professor X (Charles Francis Xavier) in the Marvel Comics X-Men series. The project name starts with "X," and like Professor X, who possesses a powerful mind that controls information, this metaphorically refers to the project managing the data scheduling in vllm.
Purpose
In vllm with multiple replicas, some long prompts have a lengthy prefill time. If other replicas have already computed the results, they can be directly transferred and used.
Usage
Simply add the parameter
enable_xavier=True
when starting the vllm model.Test
Using this script to generate a long prompt for LLM (about 9k+ prompt token):
Use
LONG_PROMPT+q1
andLONG_PROMPT+q2
as prompts to interact with the model separately for each query.Test Results:
First query (without cache, just calculating) E2E time:
LONG_PROMPT+q1
: ~2.96 sSecond query (with transferring) E2E time:
LONG_PROMPT+q2
: ~1.33 sLimitations
enable_prefix_caching
. The vllm version needs to be >= 0.6.50.0.0.0
address, so when starting xinference, you need to use the actual IP address, for example:xinference-local -H 192.168.xx.xx
.