Skip to content

Very slow 4x4 convolutions on gfx803 #134

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
huanzhang12 opened this issue Mar 30, 2020 · 4 comments
Closed

Very slow 4x4 convolutions on gfx803 #134

huanzhang12 opened this issue Mar 30, 2020 · 4 comments

Comments

@huanzhang12
Copy link

Since the ASM kernels were disabled on gfx803 in commit ce51a4c, 4x4 convolutions on gfx803 default to the very slow gemm algorithm:

./bin/MIOpenDriver conv -n 256 -c 16 -H 28 -W 28 -k 16 -y 4 -x 4 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver: conv -n 256 -c 16 -H 28 -W 28 -k 16 -y 4 -x 4 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 0, Solution: 33/gemm
GPU Kernel Time Forward Conv. Elapsed: 28.185415 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv4x4u2, 256, 16, 14, 14, 4, 4, 16,  411041792, 12861440, 3211264, 15, 1, 28.185415
Forward Convolution Verifies on CPU and GPU (7.39356e-08)

Before ASM kernels were disabled, it was much faster:

MIOpenDriver: conv -n 256 -c 16 -H 28 -W 28 -k 16 -y 4 -x 4 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 3, Solution: 15/ConvBinWinogradRxS
GPU Kernel Time Forward Conv. Elapsed: 0.592643 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv4x4u2, 256, 16, 14, 14, 4, 4, 16,  411041792, 12861440, 3211264, 694, 27, 0.592643
Forward Convolution Verifies on CPU and GPU (4.98272e-08)

The performance reduces from 694 GFLOPs to 15 GFLOPs.

I am wondering why all ASM kernels were disabled for gfx803 instead of disabling individual problematic ones?

Also, even without an ASM implementation, can we use a general OpenCL implementation in this case rather than rely on the extremely slow GEMM? (It seems conv_ocl_dir2Dfwd.cpp is not enabled for most 4x4 convolutions)

@atamazov
Copy link
Contributor

Which version of miopen do you use?

@huanzhang12
Copy link
Author

@atamazov Thanks for asking! I am using the current master of this repository (commit 2fe2ae9)
I also tried the pre-compiled MIOpen shipped with Rocm 3.1.
To obtain the performance of ASM kernels, I manually revert commit ce51a4c on the current master.

@atamazov
Copy link
Contributor

@huanzhang12 AFAICS you are using 2.2.0. Version 2.3.0 is just released. It includes c58488b that should restore gfx8 performance. Please close this if the issue is resolved.

@huanzhang12
Copy link
Author

@atamazov I tried the just released version 2.3.0 and it is amazing! It is great news that ASM kernels are re-enabled on gfx803. The same 4x4 convolution runs at 1684 GFLOPs:

./bin/MIOpenDriver conv -n 256 -c 16 -H 28 -W 28 -k 16 -y 4 -x 4 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 256 -c 16 -H 28 -W 28 -k 16 -y 4 -x 4 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 26/ConvHipImplicitGemmV4R1Fwd
GPU Kernel Time Forward Conv. Elapsed: 0.244142 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv4x4u2, 256, 16, 14, 14, 4, 4, 16,  411041792, 12861440, 3211264, 1684, 66, 0.244142
Forward Convolution Verifies on CPU and GPU (7.70828e-08)

My workload involving some 4x4 convolutions runs 10 times faster on v2.3.0. Thank you so much for the hard work and I am closing this issue.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants