-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
ggml-cuda : perform cublas fp16 matrix multiplication as fp16 #3370
Conversation
Is this actually correct? I believe compute capability 7.0 is volta, not turing. |
The computing capacity of Turing is 7.5, while that of Volta is 7.0. However, Volta also supports FP16. |
Hm, this change might actually degrade the TG performance: Before:
build: 99115f3 (1273) After:
build: da04003 (1280) Still testing to verify |
False alarm - forgot to build with |
…example * 'master' of github.com:ggerganov/llama.cpp: ggml-cuda : perform cublas mat mul of quantized types as f16 (ggml-org#3412) llama.cpp : add documentation about rope_freq_base and scale values (ggml-org#3401) train : fix KQ_pos allocation (ggml-org#3392) llama : quantize up to 31% faster on Linux and Windows with mmap (ggml-org#3206) readme : update hot topics + model links (ggml-org#3399) readme : add link to grammars app (ggml-org#3388) swift : fix build on xcode 15 (ggml-org#3387) build : enable more non-default compiler warnings (ggml-org#3200) ggml_tensor: update the structure comments. (ggml-org#3283) ggml : release the requested thread pool resource (ggml-org#3292) llama.cpp : split llama_context_params into model and context params (ggml-org#3301) ci : multithreaded builds (ggml-org#3311) train : finetune LORA (ggml-org#2632) gguf : basic type checking in gguf_get_* (ggml-org#3346) gguf : make token scores and types optional (ggml-org#3347) ci : disable freeBSD builds due to lack of VMs (ggml-org#3381) llama : custom attention mask + parallel decoding + no context swaps (ggml-org#3228) docs : mark code as Bash (ggml-org#3375) readme : add Mistral AI release 0.1 (ggml-org#3362) ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggml-org#3370)
…rg#3370) * ggml-cuda : perform cublas fp16 matrix multiplication as fp16 * try to fix rocm build * restrict fp16 mat mul to volta and up
This commit broke llama.cpp on CUDA 10. identifier "CUBLAS_COMPUTE_16F" is undefined |
Let's fix this ok? I can provide SSH access if needed. |
Old CUDA versions seem to be a low priority, but you could open a new issue to track this and maybe someone will fix it eventually. |
I am also seeing this as well as a "CUBLAS_TF32_TENSOR_OP_MATH" is undefined error when trying to compile with CUDA 10 and it would be nice to get it fixed or at least a work around we can try to get something working now. |
Improves prompt processing performance with fp16 models.
3090 Ti/WSL2: