Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ggml-cuda : perform cublas fp16 matrix multiplication as fp16 #3370

Merged
merged 3 commits into from
Sep 28, 2023

Conversation

slaren
Copy link
Member

@slaren slaren commented Sep 27, 2023

Improves prompt processing performance with fp16 models.

3090 Ti/WSL2:

model backend ngl test master t/s PR t/s speedup
llama 7B mostly F16 CUDA 99 pp 512 1661.19 ± 3.09 3984.28 ± 25.45 2.40

@slaren
Copy link
Member Author

slaren commented Sep 27, 2023

Is this actually correct? I believe compute capability 7.0 is volta, not turing.

https://github.com/ggerganov/llama.cpp/blob/7d5674dd2d045584993a46102eafa48a31388bdb/ggml-cuda.cu#L82

@bobqianic
Copy link
Contributor

Is this actually correct? I believe compute capability 7.0 is volta, not turing.

https://github.com/ggerganov/llama.cpp/blob/7d5674dd2d045584993a46102eafa48a31388bdb/ggml-cuda.cu#L82

The computing capacity of Turing is 7.5, while that of Volta is 7.0. However, Volta also supports FP16.

image

@ggerganov ggerganov merged commit da04003 into master Sep 28, 2023
@ggerganov
Copy link
Member

ggerganov commented Sep 28, 2023

Hm, this change might actually degrade the TG performance:

Before:

model size params backend ngl threads test t/s
LLaMA v2 7B mostly F16 12.55 GiB 6.74 B CUDA 999 1 pp 512 3373.39 ± 3.54
LLaMA v2 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 999 1 pp 512 2041.33 ± 0.32
LLaMA v2 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 999 1 pp 512 2084.74 ± 0.08
LLaMA v2 7B mostly Q4_1 3.95 GiB 6.74 B CUDA 999 1 pp 512 2015.38 ± 0.69
LLaMA v2 7B mostly Q5_0 4.33 GiB 6.74 B CUDA 999 1 pp 512 2042.62 ± 0.34
LLaMA v2 7B mostly Q5_1 4.72 GiB 6.74 B CUDA 999 1 pp 512 1887.36 ± 0.40
LLaMA v2 7B mostly Q6_K 5.15 GiB 6.74 B CUDA 999 1 pp 512 2041.73 ± 0.42
LLaMA v2 7B mostly Q5_K - Medium 4.45 GiB 6.74 B CUDA 999 1 pp 512 1745.14 ± 3.15
LLaMA v2 7B mostly Q5_K - Small 4.33 GiB 6.74 B CUDA 999 1 pp 512 1674.69 ± 4.07
LLaMA v2 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 999 1 pp 512 1816.83 ± 4.45
LLaMA v2 7B mostly Q4_K - Small 3.59 GiB 6.74 B CUDA 999 1 pp 512 1759.79 ± 3.11
LLaMA v2 7B mostly Q3_K - Medium 3.07 GiB 6.74 B CUDA 999 1 pp 512 1503.72 ± 1.09
LLaMA v2 7B mostly Q3_K - Small 2.75 GiB 6.74 B CUDA 999 1 pp 512 1314.71 ± 0.08
LLaMA v2 7B mostly F16 12.55 GiB 6.74 B CUDA 999 1 tg 128 72.73 ± 0.02
LLaMA v2 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 999 1 tg 128 102.59 ± 0.03
LLaMA v2 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 999 1 tg 128 143.87 ± 0.05
LLaMA v2 7B mostly Q4_1 3.95 GiB 6.74 B CUDA 999 1 tg 128 142.90 ± 0.03
LLaMA v2 7B mostly Q5_0 4.33 GiB 6.74 B CUDA 999 1 tg 128 124.20 ± 0.03
LLaMA v2 7B mostly Q5_1 4.72 GiB 6.74 B CUDA 999 1 tg 128 125.25 ± 0.03
LLaMA v2 7B mostly Q6_K 5.15 GiB 6.74 B CUDA 999 1 tg 128 103.77 ± 0.01
LLaMA v2 7B mostly Q5_K - Medium 4.45 GiB 6.74 B CUDA 999 1 tg 128 119.19 ± 0.03
LLaMA v2 7B mostly Q5_K - Small 4.33 GiB 6.74 B CUDA 999 1 tg 128 122.88 ± 0.04
LLaMA v2 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 999 1 tg 128 128.46 ± 0.03
LLaMA v2 7B mostly Q4_K - Small 3.59 GiB 6.74 B CUDA 999 1 tg 128 133.71 ± 0.02
LLaMA v2 7B mostly Q3_K - Medium 3.07 GiB 6.74 B CUDA 999 1 tg 128 112.58 ± 0.04
LLaMA v2 7B mostly Q3_K - Small 2.75 GiB 6.74 B CUDA 999 1 tg 128 100.40 ± 0.07

build: 99115f3 (1273)

After:

model size params backend ngl threads test t/s
LLaMA v2 7B mostly F16 12.55 GiB 6.74 B CUDA 999 1 pp 512 5223.21 ± 11.60
LLaMA v2 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 999 1 pp 512 2027.73 ± 1.49
LLaMA v2 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 999 1 pp 512 2074.45 ± 1.03
LLaMA v2 7B mostly Q4_1 3.95 GiB 6.74 B CUDA 999 1 pp 512 2004.63 ± 0.62
LLaMA v2 7B mostly Q5_0 4.33 GiB 6.74 B CUDA 999 1 pp 512 2032.11 ± 0.54
LLaMA v2 7B mostly Q5_1 4.72 GiB 6.74 B CUDA 999 1 pp 512 1878.42 ± 0.38
LLaMA v2 7B mostly Q6_K 5.15 GiB 6.74 B CUDA 999 1 pp 512 2029.70 ± 1.46
LLaMA v2 7B mostly Q5_K - Medium 4.45 GiB 6.74 B CUDA 999 1 pp 512 1741.54 ± 2.36
LLaMA v2 7B mostly Q5_K - Small 4.33 GiB 6.74 B CUDA 999 1 pp 512 1684.32 ± 5.19
LLaMA v2 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 999 1 pp 512 1799.69 ± 3.52
LLaMA v2 7B mostly Q4_K - Small 3.59 GiB 6.74 B CUDA 999 1 pp 512 1749.17 ± 2.88
LLaMA v2 7B mostly Q3_K - Medium 3.07 GiB 6.74 B CUDA 999 1 pp 512 1497.79 ± 2.87
LLaMA v2 7B mostly Q3_K - Small 2.75 GiB 6.74 B CUDA 999 1 pp 512 1310.20 ± 0.14
LLaMA v2 7B mostly F16 12.55 GiB 6.74 B CUDA 999 1 tg 128 69.43 ± 0.01
LLaMA v2 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 999 1 tg 128 91.49 ± 0.02
LLaMA v2 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 999 1 tg 128 137.75 ± 0.02
LLaMA v2 7B mostly Q4_1 3.95 GiB 6.74 B CUDA 999 1 tg 128 136.58 ± 0.03
LLaMA v2 7B mostly Q5_0 4.33 GiB 6.74 B CUDA 999 1 tg 128 112.14 ± 0.01
LLaMA v2 7B mostly Q5_1 4.72 GiB 6.74 B CUDA 999 1 tg 128 112.90 ± 0.01
LLaMA v2 7B mostly Q6_K 5.15 GiB 6.74 B CUDA 999 1 tg 128 92.56 ± 0.03
LLaMA v2 7B mostly Q5_K - Medium 4.45 GiB 6.74 B CUDA 999 1 tg 128 102.45 ± 0.01
LLaMA v2 7B mostly Q5_K - Small 4.33 GiB 6.74 B CUDA 999 1 tg 128 104.18 ± 0.01
LLaMA v2 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 999 1 tg 128 111.96 ± 0.02
LLaMA v2 7B mostly Q4_K - Small 3.59 GiB 6.74 B CUDA 999 1 tg 128 116.10 ± 0.01
LLaMA v2 7B mostly Q3_K - Medium 3.07 GiB 6.74 B CUDA 999 1 tg 128 103.30 ± 0.04
LLaMA v2 7B mostly Q3_K - Small 2.75 GiB 6.74 B CUDA 999 1 tg 128 94.84 ± 0.00

build: da04003 (1280)

Still testing to verify

@ggerganov
Copy link
Member

False alarm - forgot to build with LLAMA_CUDA_MMV_Y=2

@slaren slaren deleted the cublas-f16 branch September 28, 2023 10:46
joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 2, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp:
  ggml-cuda : perform cublas mat mul of quantized types as f16 (ggml-org#3412)
  llama.cpp : add documentation about rope_freq_base and scale values (ggml-org#3401)
  train : fix KQ_pos allocation (ggml-org#3392)
  llama : quantize up to 31% faster on Linux and Windows with mmap (ggml-org#3206)
  readme : update hot topics + model links (ggml-org#3399)
  readme : add link to grammars app (ggml-org#3388)
  swift : fix build on xcode 15 (ggml-org#3387)
  build : enable more non-default compiler warnings (ggml-org#3200)
  ggml_tensor: update the structure comments. (ggml-org#3283)
  ggml : release the requested thread pool resource (ggml-org#3292)
  llama.cpp : split llama_context_params into model and context params (ggml-org#3301)
  ci : multithreaded builds (ggml-org#3311)
  train : finetune LORA (ggml-org#2632)
  gguf : basic type checking in gguf_get_* (ggml-org#3346)
  gguf : make token scores and types optional (ggml-org#3347)
  ci : disable freeBSD builds due to lack of VMs (ggml-org#3381)
  llama : custom attention mask + parallel decoding + no context swaps (ggml-org#3228)
  docs : mark code as Bash (ggml-org#3375)
  readme : add Mistral AI release 0.1 (ggml-org#3362)
  ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggml-org#3370)
yusiwen pushed a commit to yusiwen/llama.cpp that referenced this pull request Oct 7, 2023
…rg#3370)

* ggml-cuda : perform cublas fp16 matrix multiplication as fp16

* try to fix rocm build

* restrict fp16 mat mul to volta and up
@whoreson
Copy link
Contributor

This commit broke llama.cpp on CUDA 10.

identifier "CUBLAS_COMPUTE_16F" is undefined

@whoreson
Copy link
Contributor

Let's fix this ok? I can provide SSH access if needed.

@cebtenzzre
Copy link
Collaborator

This commit broke llama.cpp on CUDA 10.

identifier "CUBLAS_COMPUTE_16F" is undefined

Old CUDA versions seem to be a low priority, but you could open a new issue to track this and maybe someone will fix it eventually.

@ByerRA
Copy link

ByerRA commented Nov 1, 2023

This commit broke llama.cpp on CUDA 10.

identifier "CUBLAS_COMPUTE_16F" is undefined

I am also seeing this as well as a "CUBLAS_TF32_TENSOR_OP_MATH" is undefined error when trying to compile with CUDA 10 and it would be nice to get it fixed or at least a work around we can try to get something working now.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants