[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304

chenhu-wang · 2025-01-08T06:59:48Z

Details:

Snippets MatMul via block wised brgemm emitter and executor on aarch64 with TPP
Snippets MatMul via block wised brgemm emitter and executor on x64 with TPP

Tickets:

CVS-151344

chenhu-wang · 2025-01-15T08:17:21Z

@a-sidorova, Could you please review as well, as you are reviewing #28229. The test cases passed on arm for snippets MatMul. Thank you!

src/plugins/intel_cpu/src/nodes/subgraph.cpp

src/plugins/intel_cpu/src/transformations/tpp/aarch64/pass/lowered/brgemm_tpp_blocking.cpp

src/plugins/intel_cpu/src/emitters/tpp/aarch64/jit_brgemm_emitter.cpp

src/plugins/intel_cpu/thirdparty/CMakeLists.txt

cmake/features.cmake

src/plugins/intel_cpu/src/emitters/snippets/brgemm_base.cpp

src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.cpp

src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.hpp

src/plugins/intel_cpu/src/emitters/tpp/aarch64/jit_brgemm_emitter.cpp

src/plugins/intel_cpu/src/emitters/tpp/common/kernel_executors/brgemm.cpp

v-Golubev · 2025-02-14T18:01:59Z

src/plugins/intel_cpu/src/emitters/tpp/common/kernel_executors/brgemm.cpp

+    gemm_p.a.primary = in1;
+    gemm_p.b.primary = in0;


in0 and in1 look like mixed up: I'd say that A input should be in0, not in1. However, in x64 impl, there is the same situation... Do you have any idea why it is done in this way?
@IvanNovoselov or maybe you have a secret TPP knowledge why we form runtime args in such way? :)

The data is row major in ov, MatMul in libxsmm assume data is column major. Exchange in0 and in1 could avoid data repack. The M/N, lda/ldb and in0/in1 precisions are also exchanged in libxsmm_create_gemm_shape(). @IvanNovoselov could you confirm it or correct me if I misunderstand it. Thank you!

Aha, got it, thanks for the explanation! Maybe we could leave an explanatory comment then to avoid potential questions in the future?

comment added!

src/plugins/intel_cpu/src/emitters/tpp/common/kernel_executors/brgemm.hpp

v-Golubev

LGTM 👍

v-Golubev · 2025-02-17T13:33:04Z

src/plugins/intel_cpu/src/emitters/tpp/common/kernel_executors/brgemm.cpp

+    gemm_p.a.primary = in1;
+    gemm_p.b.primary = in0;


Aha, got it, thanks for the explanation! Maybe we could leave an explanatory comment then to avoid potential questions in the future?

src/plugins/intel_cpu/CMakeLists.txt

IvanNovoselov · 2025-02-21T15:33:25Z

src/plugins/intel_cpu/src/emitters/tpp/common/kernel_executors/brgemm.hpp

+class BrgemmKernelExecutor : public CPUKernelExecutor<BrgemmKernelConfig, BrgemmTppCompiledKernel> {
+public:
+    BrgemmKernelExecutor(ov::intel_cpu::MultiCacheWeakPtr kernel_cache, BrgemmKernelConfig config);
+


Please note that the main difference between CPUKernelExecutor and snippets::KernelExecutor is that the former one implements caching of compiled kernels.
However as far as I remember, libxsmm have a build-in caching functionality (here for example), so every time we compile a TPP it is saved in libxsmm's internal cache. Could you check this moment please?
If it's true, then all TPP kernel executors should be derived directly from snippets::KernelExecutor, and update_kernel should not perform any caching.
It is likely that we perform double-caching in current implementation: inside libxsmm, and in intel_cpu::Multicache.

Hi @IvanNovoselov, I checked that libxsmm indeed has internal cache(if I create libxsmm_gemmfunction multiple times with the same params, it return the kernels with same address).
As I far I understand, OV CPU level cache could still provide some value. For example, at current libxsmm_create_gemm_shape() could be avoided with this cache. If we need fine tune the performance based on input shape in the future, sometimes we need libxsmm_dispatch_gemm, other shape need "libxsmm_dispatch_brgemm". Or we need set different environment variables based on different shape. Or preprocess const weight in compile_kenel(), etc.. All these prepare params task other than kernel generation could be saved if use OV CPU level cache.
OV CPU level cache also have some features, such as capability, set to zero to disable cache, and set max value to stop cache if exceed it, to not let memory grow unlimited. Could it be also used in snippets?
Do you think it is reasonable? Thank you!

Discussed offline. In short, libxsmm_create_gemm_shape() is very lightweight, and it doesn't need caching. Other mentioned use cases (e.g. use several primitives in the same executor, set env variables etc) are possible in theory, but there are currently no real examples where this functionality is required.
I agree to merge as is, but we'll need to remove this extra caching later, if libxsmm brgemm will be the default solution for ARM.

Vladislav comments apply-2

chenhu-wang requested review from a team as code owners January 8, 2025 06:59

github-actions bot added the category: CPU OpenVINO CPU plugin label Jan 8, 2025

chenhu-wang marked this pull request as draft January 8, 2025 07:28

github-actions bot added the category: build OpenVINO cmake script / infra label Jan 9, 2025

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch 3 times, most recently from a5b829d to 6ca4f1b Compare January 9, 2025 08:09

chenhu-wang marked this pull request as ready for review January 13, 2025 06:37

chenhu-wang requested a review from a team as a code owner January 13, 2025 06:37

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch 15 times, most recently from 982e2c2 to 6e05cb1 Compare January 15, 2025 05:33

a-sidorova reviewed Jan 15, 2025

View reviewed changes

a-sidorova self-assigned this Jan 15, 2025

v-Golubev self-assigned this Jan 16, 2025

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch from 6e05cb1 to 96e274c Compare January 22, 2025 06:36

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch 2 times, most recently from ce991bb to 2a7d14e Compare February 14, 2025 14:41

v-Golubev reviewed Feb 14, 2025

View reviewed changes

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch 3 times, most recently from ce3e097 to 5411abb Compare February 17, 2025 06:35

v-Golubev approved these changes Feb 17, 2025

View reviewed changes

v-Golubev assigned IvanNovoselov and unassigned v-Golubev Feb 17, 2025

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch from 5411abb to 3560cd9 Compare February 18, 2025 09:19

chenhu-wang requested a review from IvanNovoselov February 19, 2025 03:25

IvanNovoselov reviewed Feb 21, 2025

View reviewed changes

alvoron approved these changes Feb 25, 2025

View reviewed changes

chenhu-wang added 14 commits February 26, 2025 09:59

brgemm emitter and executor

ddbc8f1

executor cache

7062f3a

update arm passes, test enable

399386d

update cmake

60c2082

refactor tpp on x64 and aarch64

9234c3c

apply Alexandra comments

c90d07a

Matmul with TPP on x64

ce1d87e

update after rebase

d7b7868

apply Alexandra comments-2

62217cb

check android

8b6fa9b

revert Android check

eaf0c8c

apply Vladislav comments

7c04991

apply comments continue

0187b19

Vladislav comments apply-2

Ivan suggestions apply

e1a807d

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch from 3560cd9 to e1a807d Compare February 26, 2025 02:19

IvanNovoselov approved these changes Feb 26, 2025

View reviewed changes

IvanNovoselov added this pull request to the merge queue Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304

[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304

chenhu-wang commented Jan 8, 2025 •

edited

Loading

chenhu-wang commented Jan 15, 2025 •

edited

Loading

v-Golubev Feb 14, 2025

chenhu-wang Feb 17, 2025 •

edited

Loading

v-Golubev Feb 17, 2025

chenhu-wang Feb 18, 2025

v-Golubev left a comment

v-Golubev Feb 17, 2025

IvanNovoselov Feb 21, 2025

chenhu-wang Feb 24, 2025

IvanNovoselov Feb 25, 2025

[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304

[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304

Conversation

chenhu-wang commented Jan 8, 2025 • edited Loading

Details:

Tickets:

chenhu-wang commented Jan 15, 2025 • edited Loading

v-Golubev Feb 14, 2025

Choose a reason for hiding this comment

chenhu-wang Feb 17, 2025 • edited Loading

Choose a reason for hiding this comment

v-Golubev Feb 17, 2025

Choose a reason for hiding this comment

chenhu-wang Feb 18, 2025

Choose a reason for hiding this comment

v-Golubev left a comment

Choose a reason for hiding this comment

v-Golubev Feb 17, 2025

Choose a reason for hiding this comment

IvanNovoselov Feb 21, 2025

Choose a reason for hiding this comment

chenhu-wang Feb 24, 2025

Choose a reason for hiding this comment

IvanNovoselov Feb 25, 2025

Choose a reason for hiding this comment

chenhu-wang commented Jan 8, 2025 •

edited

Loading

chenhu-wang commented Jan 15, 2025 •

edited

Loading

chenhu-wang Feb 17, 2025 •

edited

Loading