This repository has been archived by the owner on May 27, 2021. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 55
Add matmul API #629
Draft
thomasfaingnaert
wants to merge
35
commits into
JuliaGPU:master
Choose a base branch
from
thomasfaingnaert:tf/matmul-kernel
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Add matmul API #629
thomasfaingnaert
wants to merge
35
commits into
JuliaGPU:master
from
thomasfaingnaert:tf/matmul-kernel
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4b580fa
to
899edb5
Compare
This ensures that the size of the array in global memory is known statically.
This reverts commit 3f64767.
6ea499d
to
2acd20b
Compare
This reverts commit b33cbce.
e573a92
to
2298f57
Compare
This reverts commit c93cf4d.
# for free
to subscribe to this conversation on GitHub.
Already have an account?
#.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains an initial implementation of (my proposal for) an API to instantiate flexible matrix multiplication kernels.
It is divided in two large parts:
src/device/tiling.jl
)src/device/matmul_kernels*
)The matmul API itself consists of several components, which allow the user to customise the behaviour of the GEMM:
config.jl
: This file defines theConfig
type that allows the user to customise the parameters of the matmul. A helper functionget_config
allows creating thisConfig
easily, and additionally includes some heuristics to set default values for parameters the user does not specify. Note that the tiling sizes are specified in a "logical" coordinate space, i.e. the precise meaning is up to the user.layout.jl
: Layouts determine how the logical coordinates are converted to physical offsets in memory. Each matrix (A
,B
,C
,D
) can have a different layout in both global and shared memory.transform.jl
: Transforms are applied after every load, and before every store. They are essentially functors that are baked into the memory stream from global to shared memory and from shared memory to registers (and vice versa forD
, obviously). The most obvious use case here is elementwise transforms (be it scaling or activation functions in neural nets).operator.jl
: Operators define the computation performed in the inner loop of the GEMM, and how it is performed.epilogue.jl
: Epilogues define what happens at the last step of the GEMM. At that point, each CTA has a tile of the resultant matrix in shared memory. The default epilogue just stores this tile to global memory, but other epilogues may perform more complex operations, such as reductions across thread blocks.kernel.jl
: The implementation of the matrix multiplication kernel itself. It uses the abstractions described above, and the Tiling API.At the moment, only the components needed for a mixed-precision GEMM using WMMA is implemented (about 1 or 2 components per abstraction).
For
M = N = K = 2048
, the Julia implementation takes about 536 us, compared to cuBLAS's 440 us (turing_s1688gemm_fp16_128x256_ldg8_nn
), resulting in a performance of about 82% that of cuBLAS.As a final note, I have mainly been testing this on Julia v1.5.0-DEV-324 (LLVM 9.0.1).
While the matmul still works in Julia 1.4.1 (LLVM 8.0.1), I've noticed a reduction in performance, which seems to be mainly caused by the
@unroll for
s not being unrolled.