Skip to content

No cuBLAS performance gain for F16 #1249

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
ggerganov opened this issue Apr 30, 2023 · 3 comments
Closed

No cuBLAS performance gain for F16 #1249

ggerganov opened this issue Apr 30, 2023 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@ggerganov
Copy link
Member

I noticed that using cuBLAS with the F16 model does not give any benefit compared to non-BLAS CPU-only mode:

# with cuBLAS
$ ▶ make clean && LLAMA_CUBLAS=1 make -j && time ./perplexity -m ./models/7B/ggml-model-f16.bin -f ./build/wiki.test.raw --no-mmap -t 12
....
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
14.33 seconds per pass - ETA 2 hours 36 minutes
[1]4.2336,^C^C

# without BLAS
$ ▶ make clean && make -j && time ./perplexity -m ./models/7B/ggml-model-f16.bin -f ./build/wiki.test.raw --no-mmap -t 12
...
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
13.75 seconds per pass - ETA 2 hours 30 minutes
[1]4.2335,^C

System:

  • GeForce GTX 1660
  • AMD Ryzen 9 5950X

In contrast, when using a quantized model, the cuBLAS run is significantly faster

Is this expected?
I was hoping to have some performance improvement for F16 as well.
Maybe the data transfer is very slow for F16 and it defeats the purpose of offloading to the GPU?

I noticed this after porting the latest ggml to whisper.cpp where we use F16 precision and was surprised that cuBLAS does not bring any improvement.

For example, sometime ago I tried using NVBLAS in whisper.cpp and it did bring some decent improvements: ggml-org/whisper.cpp#220 (comment)

The NVBLAS code change was very trivial: ggml-org/whisper.cpp#239
What could NVBLAS be doing better in this case?

@ggerganov ggerganov added the question Further information is requested label Apr 30, 2023
@slaren
Copy link
Member

slaren commented Apr 30, 2023

With a RTX 3080:

F16 used to be the fastest before dequantization on the GPU was implemented: #1044

With the current master, it is still faster than it was originally, so I don't think that there has been a regression:
3.50 seconds per pass - ETA 38 minutes

I don't know why this isn't the case with your GTX 1660. From what I could find, it is a turing chip that can do FP16.

@slaren
Copy link
Member

slaren commented Apr 30, 2023

I have been experimenting with doing the f16xf32 mat muls in f32 (instead of f16 as it is currently done) in https://github.com/slaren/llama.cpp/commits/cuda-f16f32

For me, this is faster with quantized models, but slower with F16 models, but maybe with your GPU the results are different.

@ggerganov
Copy link
Member Author

Thanks - it seems the problem is in the GeForce GTX 1660 somehow.
Ran the same test on GeForce RTX 4080 and there is significant improvement.
Also, whisper.cpp is much faster with cuBLAS

I think the NVBLAS test that I did before was on GeForce RTX 2060

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants