-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304
[CPU][ARM][x64]Snippets MatMul via brgemm emitter and executor #28304
Conversation
a5b829d
to
6ca4f1b
Compare
982e2c2
to
6e05cb1
Compare
@a-sidorova, Could you please review as well, as you are reviewing #28229. The test cases passed on arm for snippets MatMul. Thank you! |
src/plugins/intel_cpu/src/transformations/tpp/aarch64/pass/lowered/brgemm_tpp_blocking.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.hpp
Outdated
Show resolved
Hide resolved
6e05cb1
to
96e274c
Compare
ce991bb
to
2a7d14e
Compare
src/plugins/intel_cpu/src/emitters/tpp/aarch64/jit_brgemm_emitter.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/tpp/common/kernel_executors/brgemm.cpp
Outdated
Show resolved
Hide resolved
gemm_p.a.primary = in1; | ||
gemm_p.b.primary = in0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in0
and in1
look like mixed up: I'd say that A input should be in0
, not in1
. However, in x64 impl, there is the same situation... Do you have any idea why it is done in this way?
@IvanNovoselov or maybe you have a secret TPP knowledge why we form runtime args in such way? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data is row major
in ov, MatMul in libxsmm assume data is column major
. Exchange in0
and in1
could avoid data repack. The M/N, lda/ldb and in0/in1 precisions are also exchanged in libxsmm_create_gemm_shape()
. @IvanNovoselov could you confirm it or correct me if I misunderstand it. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, got it, thanks for the explanation! Maybe we could leave an explanatory comment then to avoid potential questions in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment added!
src/plugins/intel_cpu/src/emitters/tpp/common/kernel_executors/brgemm.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/tpp/common/kernel_executors/brgemm.hpp
Outdated
Show resolved
Hide resolved
ce3e097
to
5411abb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
gemm_p.a.primary = in1; | ||
gemm_p.b.primary = in0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, got it, thanks for the explanation! Maybe we could leave an explanatory comment then to avoid potential questions in the future?
5411abb
to
3560cd9
Compare
class BrgemmKernelExecutor : public CPUKernelExecutor<BrgemmKernelConfig, BrgemmTppCompiledKernel> { | ||
public: | ||
BrgemmKernelExecutor(ov::intel_cpu::MultiCacheWeakPtr kernel_cache, BrgemmKernelConfig config); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that the main difference between CPUKernelExecutor
and snippets::KernelExecutor
is that the former one implements caching of compiled kernels.
However as far as I remember, libxsmm have a build-in caching functionality (here for example), so every time we compile a TPP it is saved in libxsmm's internal cache. Could you check this moment please?
If it's true, then all TPP kernel executors should be derived directly from snippets::KernelExecutor
, and update_kernel
should not perform any caching.
It is likely that we perform double-caching in current implementation: inside libxsmm, and in intel_cpu::Multicache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @IvanNovoselov, I checked that libxsmm indeed has internal cache(if I create libxsmm_gemmfunction
multiple times with the same params, it return the kernels with same address).
As I far I understand, OV CPU level cache could still provide some value. For example, at current libxsmm_create_gemm_shape()
could be avoided with this cache. If we need fine tune the performance based on input shape in the future, sometimes we need libxsmm_dispatch_gemm
, other shape need "libxsmm_dispatch_brgemm". Or we need set different environment variables based on different shape. Or preprocess const weight in compile_kenel()
, etc.. All these prepare params task other than kernel generation could be saved if use OV CPU level cache.
OV CPU level cache also have some features, such as capability
, set to zero to disable cache, and set max value to stop cache if exceed it, to not let memory grow unlimited. Could it be also used in snippets?
Do you think it is reasonable? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. In short, libxsmm_create_gemm_shape()
is very lightweight, and it doesn't need caching. Other mentioned use cases (e.g. use several primitives in the same executor, set env variables etc) are possible in theory, but there are currently no real examples where this functionality is required.
I agree to merge as is, but we'll need to remove this extra caching later, if libxsmm brgemm will be the default solution for ARM.
Vladislav comments apply-2
3560cd9
to
e1a807d
Compare
Details:
Tickets: