-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
GPU utilization rate is very low with WHISPER_CUBLAS=1
#1179
Comments
Could you paste the full text of the log output, startup to shutdown? |
It's at least an order of magnitude slower than my M2 Air with CoreML. I'm trying with cuda on g4dn.xlarge
|
Yes, current cuBLAS support is quite rudimentary as we constantly keep moving data between the CPU and the GPU. |
This should leave a lot of room for optimization of the whole process and hopefully help speed it up |
whisper_print_timings: load time = 1177.86 ms llama_print_timings: load time = 2547.77 ms |
Fixed in #1472 |
It seems that the CPU is working most of the time while the GPU is resting. There's still a lot of room for optimization.
A 27.8-minute audio takes 62.7 minutes to transcribe...
model:
ggml-model-largev2.bin
parameters:
-bs 5
-bo 5
audio:
diffusion2023-07-03.wav
27.8 mins
The text was updated successfully, but these errors were encountered: