[GPU] Use sdpa-micro kernel for prefill processing in PagedAttention #29137

sshlyapn · 2025-02-24T14:34:56Z

Details:

Use sdpa-micro based kernel for 1st token calculation of PagedAttention (mixing prefill and generate stages or partial prefill calculation are not supported yet by micro-SDPA kernel)
Added causal mask support for micro-SDPA kernel
Updated kernel_data's internalBufferSizes structure to store information about buffer host accessibility

vladimir-paramuzov · 2025-02-26T05:56:46Z

src/plugins/intel_gpu/src/graph/impls/ocl/paged_attention.cpp

    }

-    std::vector<layout> get_internal_buffer_layouts_impl() const override {
+    std::vector<kernel_selector::InternalBuffer> get_internal_buffers_desc() const {


I'd say that extended buffer descriptor can be returned from get_internal_buffer_layouts_impl and used in the primitive_inst directly. IMO, current solution with separate methods to query layouts and lockable buffers doesn't look good. Please consider changing that in the future

vladimir-paramuzov · 2025-02-26T06:00:11Z

src/plugins/intel_gpu/src/graph/impls/ocl/paged_attention.cpp

+
+        auto& pa_inst = reinterpret_cast<paged_attention_inst&>(inst);
+        pa_inst.query_block_size = get_query_block_size(PagedAttentionStage::PREFILL);
+        pa_inst.use_micro_sdpa = use_micro_sdpa;


Can we avoid having implementation specific attributes in primitive_inst? Maybe the code from paged_attention_inst::on_execute() can be moved to primitive impl itself?

sshlyapn added the category: GPU OpenVINO GPU plugin label Feb 24, 2025

sshlyapn added this to the 2025.1 milestone Feb 24, 2025

sshlyapn requested review from a team as code owners February 24, 2025 14:34

sshlyapn force-pushed the paged_attention_micro_sdpa_prefill branch 3 times, most recently from e6b2b07 to c19414b Compare February 25, 2025 06:55

[GPU] Use sdpa-micro kernel for prefill processing in PagedAttention

8638537

sshlyapn force-pushed the paged_attention_micro_sdpa_prefill branch from c19414b to 8638537 Compare February 25, 2025 07:34

vladimir-paramuzov reviewed Feb 26, 2025

View reviewed changes

Merge branch 'master' into paged_attention_micro_sdpa_prefill

86d3ea6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Use sdpa-micro kernel for prefill processing in PagedAttention #29137

[GPU] Use sdpa-micro kernel for prefill processing in PagedAttention #29137

sshlyapn commented Feb 24, 2025

vladimir-paramuzov Feb 26, 2025

vladimir-paramuzov Feb 26, 2025

[GPU] Use sdpa-micro kernel for prefill processing in PagedAttention #29137

Are you sure you want to change the base?

[GPU] Use sdpa-micro kernel for prefill processing in PagedAttention #29137

Conversation

sshlyapn commented Feb 24, 2025

Details:

vladimir-paramuzov Feb 26, 2025

Choose a reason for hiding this comment

vladimir-paramuzov Feb 26, 2025

Choose a reason for hiding this comment