Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[GPU] Use sdpa-micro kernel for prefill processing in PagedAttention #29137

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sshlyapn
Copy link
Contributor

Details:

  • Use sdpa-micro based kernel for 1st token calculation of PagedAttention (mixing prefill and generate stages or partial prefill calculation are not supported yet by micro-SDPA kernel)
  • Added causal mask support for micro-SDPA kernel
  • Updated kernel_data's internalBufferSizes structure to store information about buffer host accessibility

@sshlyapn sshlyapn added the category: GPU OpenVINO GPU plugin label Feb 24, 2025
@sshlyapn sshlyapn added this to the 2025.1 milestone Feb 24, 2025
@sshlyapn sshlyapn requested review from a team as code owners February 24, 2025 14:34
@sshlyapn sshlyapn force-pushed the paged_attention_micro_sdpa_prefill branch 3 times, most recently from e6b2b07 to c19414b Compare February 25, 2025 06:55
@sshlyapn sshlyapn force-pushed the paged_attention_micro_sdpa_prefill branch from c19414b to 8638537 Compare February 25, 2025 07:34
}

std::vector<layout> get_internal_buffer_layouts_impl() const override {
std::vector<kernel_selector::InternalBuffer> get_internal_buffers_desc() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that extended buffer descriptor can be returned from get_internal_buffer_layouts_impl and used in the primitive_inst directly. IMO, current solution with separate methods to query layouts and lockable buffers doesn't look good. Please consider changing that in the future


auto& pa_inst = reinterpret_cast<paged_attention_inst&>(inst);
pa_inst.query_block_size = get_query_block_size(PagedAttentionStage::PREFILL);
pa_inst.use_micro_sdpa = use_micro_sdpa;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid having implementation specific attributes in primitive_inst? Maybe the code from paged_attention_inst::on_execute() can be moved to primitive impl itself?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
category: GPU OpenVINO GPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants