-
Notifications
You must be signed in to change notification settings - Fork 11.4k
ClBlast - no gpu load, no perfomans difference. #1217
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
I assume you have run a large enough prompt that BLAS was actually getting used? I'm not sure how it could go wrong then, it has picked the correct device and is obviously loaded. |
Yes, for example i tried classic dan promt:
|
I don't know what the issue could be. I haven't observed any case of correct initialization but no gpu load. |
I was able to make it work on Windows with w64devkit, but I had to build them from source ( CLBlast has libraries available on the releases page but do they work with w64devkit? |
0cc4m, This is strange because everything works in koboldcpp. I think there is something wrong with my build process. Could you please write down how you built all this under windows? I think it would be useful to add this to the readme as well. |
@Folko-Ven I'm sorry, I don't use Windows. |
SlyEcho, I'm sorry to bother you again with this question, but could you please describe the whole process step by step? |
I actually used CMake GUI for a lot of it, but I guess if you don't know how these things work it is still hard. I'll try to come up with something when I'm back on Windows. |
OK, @Folko-Ven First, try to follow the instructions to build with OpenBLAS on Windows using w64devkit in the README. If that is working, then let's continue.
At this point it should be possible to use make: make -B LLAMA_CLBLAST=1 The |
SlyEcho, |
I did more testing also.
They all perform the same, that is very, getting around 60 ms per token. Plugged in, fresh reboot, Windows task manager is not showing all the GPU load by default, I had to change one of the panels to show "Compute 1" where llama.cpp compute could be seen. The machine is a ThinkPad P14s with a Ryzen 7 PRO 5850U with Radeon Pro Graphics and 48GB of RAM. Actually, @Folko-Ven, now that I look at your first post, the instructions I gave are pretty much identical. I will try Linux next and see if there is a difference. |
SlyEcho You shouldn't waste so much time, the performance of openBLAS is not bad either, besides I don't use long promts that often. P.S.
How did you do that? |
There is just a little OK, Linux testing:
non-CL:
|
That is expected, all of the BLAS implementations, including CLBlast, only accelerate the initial prompt processing, not the token generation. |
@0cc4m I see, although the initial prompt processing can be long, it seems to be a fixed amount of time, whereas the token generation for long prompts can take far longer. I wonder if there's be any benefit to offloading the token generation to the GPU as well |
I now have some builds on my fork's releases page. Currently there is a version there with OpenBLAS 0.3.23. |
blas usage is only used when batch processing is viable AND its more then |
@Green-Sky Yes I see it in ggml_compute_forward_mul_mat_use_blas However, it looks like the matrices are being individually copied and executed on the GPU rather than being properly batched, unless I'm understanding this incorrectly.. ggml_cl_sgemm_wrapper is handling the GPU malloc, and its being called inside of an inner for loop, which causing multiple calls to ggml_cl_malloc Ideally we'd buffer as many matrices as we could before execution, but this seems to using a copy->execute per matrix execution model which is expensive. |
You can experiment with the limits in Also, I think the CL version cannot use non-contiguous tensors like the CUDA version can. |
@SlyEcho I believe the reason its slower is because the over-head is increased because we are doing a copy per execute, instead of copy as many as fits -> execute -> copy the rest -> execute. Excessive calls to ggml_cl_malloc would explain the slowdown but this needs experimentation to confirm. |
There will always be some part to copy because all of the computation is not happening on the GPU, and also, all the weights might not fit into GPU memory depending on the device or the model. The GPU memory management could be much smarter, yes. But that would mean ggml needs to heavily be GPU-oriented, which is not something that is wanted. The memory management could also be done on a higher level in llama.cpp, similarily to other methods like the KV cache and the scratch buffers. For CUDA and ROCm (#1087) there are more advanced memory management features and it helps a a lttle bit to make the copying faster, but I don't know easy it is to extend that to OpenCL. |
@SlyEcho I did some experiments with non-contiguous transfer and FP16 kernels, you can take a look if you want. However, the result was slower than the current implementation in my tests. Not sure if I screwed up anywhere. FP16 only works on AMD and Intel because Nvidia refuses to implement that feature for OpenCL. |
A bit of a side note, but if anybody wants to give it a try, I recently implemented a FP16C vectorized Vectorizing fp16 to fp32 should also be possible with |
Tip for Windows people @Folko-Ven - Install and configure MSYS2. To get clblast, install the packages using msys console:
Then Similar for openblas:
Replace this line with |
Thanks! It worked! I don't understand why compiling with w64devkit was causing me problems. |
It should work with msys2 fine, but it is a little limited, because you have to use the msys2 console to run the program. Well, it is possible to build it better, but I recommended w64devkit because it should give you an .exe that just works. |
Additionally, you don't have to open MSYS console at all if add msys environment to $PATH. This way you can have compilers, libraries, and POSIX commands available globally. My $PATH includes these:
This makes Windows feel like Unix. |
How i build:
When load i got this:
But no gpu load, no perfomans difference. Btw when i use koboldcpp i got ~40-60% gpu load.
What could have gone wrong? And how build CLBlast with static libraries?
P.S. I use ryzen 5700u without dgpu.
The text was updated successfully, but these errors were encountered: