-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Core][v1] Unify allocating slots in prefill and decode in KV cache manager #12608
[Core][v1] Unify allocating slots in prefill and decode in KV cache manager #12608
Conversation
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Assume in both prefill and decode, num_tokens should not be zero, previously only prefill assumed this. Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the unify approach, but have concerns about the untouch:
- It mark the untouched blocks as recent used, which is not right.
- It introduces some overheads with long context, because you touch a long sequence of blocks and untouch them.
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Thank you so much for your review.
Strictly yes, but perhaps it does not hurt to mark these blocks because they are existentially most wanted blocks.
Right. There is pros and cons to this speculative touch-and-untouch-if-have-to approach. The positive side is that we don't need to calculate the num_evictable_computed_blocks (short and long) every time, which is arguably the 'normal' case, so the net outcome could still be positive. Please feel free to choose either one, the untouch is reverted in the last commit. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM. It's pretty clean now.
vllm/v1/core/kv_cache_manager.py
Outdated
# Get new blocks from the free block pool considering | ||
# preallocated blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the entire change based on allocate_tokens, not as git diff suggested on append_tokens. These comments were actually a few lines above (slightly different), but they should be moved to this location, and I moved them.
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Thank you so much for your time. |
@WoosukKwon PTAL if you got a chance. Will merge the PR once CI is passed (or force merge if needed). |
…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: Isotr0py <2037008807@qq.com>
…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>
…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>
…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: Felix Marty <felmarty@amd.com>
…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>
…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>
…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>
…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>
…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>
…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: Linkun Chen <github@lkchen.net>
…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: saeediy <saidakbarp@gmail.com>
As mentioned in RFC #12254, this PR achieves the task: combine allocate_slots and append_slots.
There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly.
@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo