[Core][v1] Unify allocating slots in prefill and decode in KV cache manager #12608

ShawnD200 · 2025-01-31T12:48:00Z

As mentioned in RFC #12254, this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

Signed-off-by: Shawn Du <shawnd200@outlook.com>

Assume in both prefill and decode, num_tokens should not be zero, previously only prefill assumed this. Signed-off-by: Shawn Du <shawnd200@outlook.com>

Signed-off-by: Shawn Du <shawnd200@outlook.com>

github-actions · 2025-01-31T12:48:12Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

comaniac

I prefer the unify approach, but have concerns about the untouch:

It mark the untouched blocks as recent used, which is not right.
It introduces some overheads with long context, because you touch a long sequence of blocks and untouch them.

vllm/v1/core/kv_cache_manager.py

Signed-off-by: Shawn Du <shawnd200@outlook.com>

ShawnD200 · 2025-02-01T04:23:33Z

Thank you so much for your review.

I prefer the unify approach, but have concerns about the untouch:

It mark the untouched blocks as recent used, which is not right.

Strictly yes, but perhaps it does not hurt to mark these blocks because they are existentially most wanted blocks.

It introduces some overheads with long context, because you touch a long sequence of blocks and untouch them.

Right. There is pros and cons to this speculative touch-and-untouch-if-have-to approach. The positive side is that we don't need to calculate the num_evictable_computed_blocks (short and long) every time, which is arguably the 'normal' case, so the net outcome could still be positive.

Please feel free to choose either one, the untouch is reverted in the last commit.

Thanks.

comaniac

Otherwise LGTM. It's pretty clean now.

vllm/v1/core/kv_cache_manager.py

comaniac · 2025-02-01T04:52:09Z

vllm/v1/core/kv_cache_manager.py

-            # Get new blocks from the free block pool considering
-            # preallocated blocks.


Why remove this comment?

I made the entire change based on allocate_tokens, not as git diff suggested on append_tokens. These comments were actually a few lines above (slightly different), but they should be moved to this location, and I moved them.

Signed-off-by: Shawn Du <shawnd200@outlook.com>

ShawnD200 · 2025-02-01T09:22:02Z

Otherwise LGTM. It's pretty clean now.

Thank you so much for your time.

comaniac · 2025-02-02T00:08:05Z

@WoosukKwon PTAL if you got a chance. Will merge the PR once CI is passed (or force merge if needed).

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: Isotr0py <2037008807@qq.com>

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: Felix Marty <felmarty@amd.com>

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: Linkun Chen <github@lkchen.net>

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: saeediy <saidakbarp@gmail.com>

ShawnD200 added 5 commits January 29, 2025 11:50

Add _untouch() to reverse touch() if not enough blocks

cf0cac2

Signed-off-by: Shawn Du <shawnd200@outlook.com>

Combine allocate_slots and append_slots

5cedce5

Signed-off-by: Shawn Du <shawnd200@outlook.com>

Delete append_slots

a482f5d

Signed-off-by: Shawn Du <shawnd200@outlook.com>

Modify test case in prefix caching

23772e9

Assume in both prefill and decode, num_tokens should not be zero, previously only prefill assumed this. Signed-off-by: Shawn Du <shawnd200@outlook.com>

Address static checkers

075f1b5

Signed-off-by: Shawn Du <shawnd200@outlook.com>

ShawnD200 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners January 31, 2025 12:48

comaniac self-assigned this Jan 31, 2025

comaniac requested changes Jan 31, 2025

View reviewed changes

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

ShawnD200 added 2 commits February 1, 2025 12:19

Address reviewer comments

c7ef003

Signed-off-by: Shawn Du <shawnd200@outlook.com>

Remove _untouch

8b2172a

Signed-off-by: Shawn Du <shawnd200@outlook.com>

comaniac approved these changes Feb 1, 2025

View reviewed changes

mergify bot added the v1 label Feb 1, 2025

Address comments

b425676

Signed-off-by: Shawn Du <shawnd200@outlook.com>

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 1, 2025

DarkLight1337 merged commit f8ece6e into vllm-project:main Feb 2, 2025
55 of 57 checks passed

ShawnD200 deleted the unify-prefill-and-decode branch February 2, 2025 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][v1] Unify allocating slots in prefill and decode in KV cache manager #12608

[Core][v1] Unify allocating slots in prefill and decode in KV cache manager #12608

ShawnD200 commented Jan 31, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 31, 2025

comaniac left a comment

ShawnD200 commented Feb 1, 2025 •

edited

Loading

comaniac left a comment

comaniac Feb 1, 2025

ShawnD200 Feb 1, 2025

ShawnD200 commented Feb 1, 2025

comaniac commented Feb 2, 2025

		# Get new blocks from the free block pool considering
		# preallocated blocks.

[Core][v1] Unify allocating slots in prefill and decode in KV cache manager #12608

[Core][v1] Unify allocating slots in prefill and decode in KV cache manager #12608

Conversation

ShawnD200 commented Jan 31, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 31, 2025

comaniac left a comment

Choose a reason for hiding this comment

ShawnD200 commented Feb 1, 2025 • edited Loading

comaniac left a comment

Choose a reason for hiding this comment

comaniac Feb 1, 2025

Choose a reason for hiding this comment

ShawnD200 Feb 1, 2025

Choose a reason for hiding this comment

ShawnD200 commented Feb 1, 2025

comaniac commented Feb 2, 2025

ShawnD200 commented Jan 31, 2025 •

edited by github-actions bot

Loading

ShawnD200 commented Feb 1, 2025 •

edited

Loading