Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Core][v1] Unify allocating slots in prefill and decode in KV cache manager #12608

Merged
merged 8 commits into from
Feb 2, 2025

Conversation

ShawnD200
Copy link
Contributor

@ShawnD200 ShawnD200 commented Jan 31, 2025

As mentioned in RFC #12254, this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Assume in both prefill and decode, num_tokens should not be zero, previously only prefill assumed this.

Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@comaniac comaniac self-assigned this Jan 31, 2025
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the unify approach, but have concerns about the untouch:

  1. It mark the untouched blocks as recent used, which is not right.
  2. It introduces some overheads with long context, because you touch a long sequence of blocks and untouch them.

Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
@ShawnD200
Copy link
Contributor Author

ShawnD200 commented Feb 1, 2025

Thank you so much for your review.

I prefer the unify approach, but have concerns about the untouch:

  1. It mark the untouched blocks as recent used, which is not right.

Strictly yes, but perhaps it does not hurt to mark these blocks because they are existentially most wanted blocks.

  1. It introduces some overheads with long context, because you touch a long sequence of blocks and untouch them.

Right. There is pros and cons to this speculative touch-and-untouch-if-have-to approach. The positive side is that we don't need to calculate the num_evictable_computed_blocks (short and long) every time, which is arguably the 'normal' case, so the net outcome could still be positive.

Please feel free to choose either one, the untouch is reverted in the last commit.

Thanks.

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM. It's pretty clean now.

Comment on lines 144 to 145
# Get new blocks from the free block pool considering
# preallocated blocks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the entire change based on allocate_tokens, not as git diff suggested on append_tokens. These comments were actually a few lines above (slightly different), but they should be moved to this location, and I moved them.

@mergify mergify bot added the v1 label Feb 1, 2025
Signed-off-by: Shawn Du <shawnd200@outlook.com>
@ShawnD200
Copy link
Contributor Author

Otherwise LGTM. It's pretty clean now.

Thank you so much for your time.

@comaniac comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 1, 2025
@comaniac
Copy link
Collaborator

comaniac commented Feb 2, 2025

@WoosukKwon PTAL if you got a chance. Will merge the PR once CI is passed (or force merge if needed).

@DarkLight1337 DarkLight1337 merged commit f8ece6e into vllm-project:main Feb 2, 2025
55 of 57 checks passed
@ShawnD200 ShawnD200 deleted the unify-prefill-and-decode branch February 2, 2025 12:44
Isotr0py pushed a commit to Isotr0py/vllm that referenced this pull request Feb 2, 2025
…anager (vllm-project#12608)

As mentioned in RFC vllm-project#12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
youngkent pushed a commit to youngkent/vllm that referenced this pull request Feb 3, 2025
…anager (vllm-project#12608)

As mentioned in RFC vllm-project#12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
srikanthsrnvs pushed a commit to srikanthsrnvs/vllm that referenced this pull request Feb 3, 2025
…anager (vllm-project#12608)

As mentioned in RFC vllm-project#12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
sahelib25 pushed a commit to krai/vllm that referenced this pull request Feb 3, 2025
…anager (vllm-project#12608)

As mentioned in RFC vllm-project#12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
fxmarty-amd pushed a commit to fxmarty-amd/vllm that referenced this pull request Feb 7, 2025
…anager (vllm-project#12608)

As mentioned in RFC vllm-project#12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Feb 7, 2025
…anager (vllm-project#12608)

As mentioned in RFC vllm-project#12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
ShangmingCai pushed a commit to ShangmingCai/vllm that referenced this pull request Feb 10, 2025
…anager (vllm-project#12608)

As mentioned in RFC vllm-project#12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
GWS0428 pushed a commit to GWS0428/VARserve that referenced this pull request Feb 12, 2025
…anager (vllm-project#12608)

As mentioned in RFC vllm-project#12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
panf2333 pushed a commit to yottalabsai/vllm that referenced this pull request Feb 18, 2025
…anager (vllm-project#12608)

As mentioned in RFC vllm-project#12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
kerthcet pushed a commit to kerthcet/vllm that referenced this pull request Feb 21, 2025
…anager (vllm-project#12608)

As mentioned in RFC vllm-project#12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Mar 5, 2025
…anager (vllm-project#12608)

As mentioned in RFC vllm-project#12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Linkun Chen <github@lkchen.net>
Said-Akbar pushed a commit to Said-Akbar/vllm-rocm that referenced this pull request Mar 7, 2025
…anager (vllm-project#12608)

As mentioned in RFC vllm-project#12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: saeediy <saidakbarp@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants