Add matmul API #629

thomasfaingnaert · 2020-04-20T16:46:38Z

This PR contains an initial implementation of (my proposal for) an API to instantiate flexible matrix multiplication kernels.
It is divided in two large parts:

A Tiling API that aims to make recursively subdividing matrices (or tensors in general) easier (src/device/tiling.jl)
The API for matrix multiplication itself, which uses the tiling API (src/device/matmul_kernels*)

The matmul API itself consists of several components, which allow the user to customise the behaviour of the GEMM:

config.jl: This file defines the Config type that allows the user to customise the parameters of the matmul. A helper function get_config allows creating this Config easily, and additionally includes some heuristics to set default values for parameters the user does not specify. Note that the tiling sizes are specified in a "logical" coordinate space, i.e. the precise meaning is up to the user.
layout.jl: Layouts determine how the logical coordinates are converted to physical offsets in memory. Each matrix (A, B, C, D) can have a different layout in both global and shared memory.
transform.jl: Transforms are applied after every load, and before every store. They are essentially functors that are baked into the memory stream from global to shared memory and from shared memory to registers (and vice versa for D, obviously). The most obvious use case here is elementwise transforms (be it scaling or activation functions in neural nets).
operator.jl: Operators define the computation performed in the inner loop of the GEMM, and how it is performed.
epilogue.jl: Epilogues define what happens at the last step of the GEMM. At that point, each CTA has a tile of the resultant matrix in shared memory. The default epilogue just stores this tile to global memory, but other epilogues may perform more complex operations, such as reductions across thread blocks.
kernel.jl: The implementation of the matrix multiplication kernel itself. It uses the abstractions described above, and the Tiling API.

At the moment, only the components needed for a mixed-precision GEMM using WMMA is implemented (about 1 or 2 components per abstraction).
For M = N = K = 2048, the Julia implementation takes about 536 us, compared to cuBLAS's 440 us (turing_s1688gemm_fp16_128x256_ldg8_nn), resulting in a performance of about 82% that of cuBLAS.

As a final note, I have mainly been testing this on Julia v1.5.0-DEV-324 (LLVM 9.0.1).
While the matmul still works in Julia 1.4.1 (LLVM 8.0.1), I've noticed a reduction in performance, which seems to be mainly caused by the @unroll fors not being unrolled.

This ensures that the size of the array in global memory is known statically.

This reverts commit 3f64767.

This reverts commit be21dfa.

This reverts commit 340e791.

This reverts commit b33cbce.

This reverts commit c93cf4d.

This reverts commit f3325fd, reversing changes made to d5ae922.

thomasfaingnaert added 3 commits April 19, 2020 23:39

Add matmul API

cdadcbf

Add benchmark scripts and results

6561ad7

Cleanup

4fe49e0

thomasfaingnaert marked this pull request as draft April 25, 2020 17:20

thomasfaingnaert added 3 commits April 25, 2020 21:15

Fix performance regression

ca01baf

Disable verification in CUTLASS profiles

8911e98

Set kernel names directly

899edb5

thomasfaingnaert force-pushed the tf/matmul-kernel branch from 4b580fa to 899edb5 Compare April 25, 2020 21:26

thomasfaingnaert added 5 commits April 29, 2020 22:37

Add generic_matmul FP32 benchmark

c2d52ee

Remove output files

6b8562d

Add generic_matmul FP16 benchmark

df4ddc2

chmod +x

62d9b47

Add generic plots to legend

d5ae922

thomasfaingnaert mentioned this pull request May 5, 2020

Avoid address space casts. #642

Merged

thomasfaingnaert added 6 commits May 25, 2020 07:39

Merge branch 'master' into tf/matmul-kernel

f3325fd

Use linearise in layout

016c86b

Reintroduce workspace_size

be21dfa

This ensures that the size of the array in global memory is known statically.

Use linearise(...) in operator

11edd5e

Split translate function

3f64767

Revert "Split translate function"

2acd20b

This reverts commit 3f64767.

thomasfaingnaert force-pushed the tf/matmul-kernel branch from 6ea499d to 2acd20b Compare May 25, 2020 14:31

thomasfaingnaert added 9 commits May 25, 2020 10:39

Revert "Reintroduce workspace_size"

968d72c

This reverts commit be21dfa.

Add components for complex matmul

ee13ad8

Add test for complex matmul

c709daf

Reduce test set size

e9b9c5c

Add scripts to profile complex CUTLASS

a0d1d28

Add dual op

8ce4dab

Add translate variant for offset

340e791

Revert "Add translate variant for offset"

1ca56c4

This reverts commit 340e791.

Remove explicit vectorisation

b33cbce

Revert "Remove explicit vectorisation"

2298f57

This reverts commit b33cbce.

thomasfaingnaert force-pushed the tf/matmul-kernel branch from e573a92 to 2298f57 Compare May 30, 2020 09:29

thomasfaingnaert added 8 commits May 30, 2020 08:40

Fix vectorisation

c93cf4d

Revert "Fix vectorisation"

47710e2

This reverts commit c93cf4d.

Revert "Merge branch 'master' into tf/matmul-kernel"

356ff58

This reverts commit f3325fd, reversing changes made to d5ae922.

Fixes

1ff1894

Add more plots

5b066e0

Add generic scripts

48d6a09

Add dual plots

43c645f

Add translate variant for offset

9cb4b3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add matmul API #629

Add matmul API #629

thomasfaingnaert commented Apr 20, 2020

Add matmul API #629

Are you sure you want to change the base?

Add matmul API #629

Conversation

thomasfaingnaert commented Apr 20, 2020