Advice for GPU/cuda in an electron app #173

samlhuillier · 2024-02-25T23:09:18Z

samlhuillier
Feb 25, 2024

Hi there, thanks for this great project! I use it here: https://github.com/reorproject/reor

I was wondering if you had any advice for the best approach to using the GPU (across OS) in the context of an electron app. So far, all I've implemented is to send 100 layers to the gpu on arm macs and there are users asking for gpu support on windows and linux and was interested in how you'd approach this - particularly along the lines of whether I should build node llama cpp with cuda or not.

The issues are these if you're curious: reorproject/reor#31 & reorproject/reor#79

giladgd · 2024-02-26T08:47:26Z

giladgd
Feb 26, 2024
Maintainer

The latest beta includes both CUDA and Vulkan prebuilt binaries for Windows and Linux, so you can utilize the GPU without compiling anything on the user’s machine.

In the next few betas I plan to improve the ‘getLlama’ method to automatically detect whether CUDA or Vulkan is supported on the user machine and enable them automatically, to always utilize the GPU support by default without any configuration necessary.

At the moment, to fully utilize the GPU on non-Mac machines, you’d have to enable CUDA or Vulkan and pass gpuLayers with the number of layers you want to offload to the GPU.
You can see the number of layers a model has in the logs printed when you load a model.
Offloading all the layers would mean you’d utilize only the GPU for the evaluation of tokens.

I also plan to make it automatically offload as much layers that could be fitted into the VRAM of the GPUs automatically by default in one of the next few betas.

0 replies

samlhuillier · 2024-02-26T13:57:34Z

samlhuillier
Feb 26, 2024
Author

Thanks for your response. Just a couple of clarifying questions:

Setting gpuLayers to something like 100 or 1000 is basically the same thing as saying use the gpu completely? (I only really want to provide a toggle for whether to use the gpu or not)
Would it fail if gpu layers is non zero on a machine without a gpu? Particularly interested in what happens on intel macs
On non-mac, setting gpulayers to non zero but neither cuda nor vulkan would do what?

4 replies

giladgd Mar 2, 2024
Maintainer

Unfortunately, setting gpuLayers to a value higher than the number of layers of the models doesn't properly work in all cases, but I plan to fix it in one of the next few beta versions.

Regarding whether setting gpuLayers would fail without a GPU - it should just ignore it when you only use the CPU.

Right now, only on Mac machines the gpuLayers is set to the number of layers of the model by default.
This is done inside of the logic of llama.cpp and I suppose it's because of the unified memory on M1 Macs.
I plan to make it load as many layers as could be fitted into the VRAM of other compute layers by default in one of the next few beta versions.

samlhuillier Mar 2, 2024
Author

Gotcha thank you. Is there a way to get the gpu layers of a model through node llama cpp?

giladgd Mar 10, 2024
Maintainer

Not at the moment, but it'll be possible in one of the next few beta versions.
For now, if you're using a specific model, you can inspect the logs to see the exact number of layers of the model and hardcode this value.

giladgd Apr 5, 2024
Maintainer

@samlhuillier The latest beta version now maximizes the use of the GPU by adjusting the default options based on the available system resources, so there's no need to configure gpuLayers or contextSize anymore to maximize the use of the GPU.

If you want to force it to attempt to offload all the layers to the GPU, you can do so by setting gpuLayers: "max", but it's worth noting that if your VRAM is too low, setting gpuLayers to a high value may consume all of the VRAM on the model and won't leave enough VRAM to create a context with your desired contextSize.

The default settings attempt to balance gpuLayers and contextSize to achieve the best experience, but in case you know you need a certain context size and willing to sacrifice GPU utilization to achieve that, you can use gpuLayers: {fitContext: {contextSize: <you desired context size}} to make it attempt to offload as many layers as possible to the GPU while leaving enough VRAM so you can later create a context with that size with the model you just loaded.

samlhuillier · 2024-02-29T23:42:04Z

samlhuillier
Feb 29, 2024
Author

Any update on this @giladgd ?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice for GPU/cuda in an electron app #173

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Advice for GPU/cuda in an electron app #173

samlhuillier Feb 25, 2024

Replies: 3 comments · 4 replies

giladgd Feb 26, 2024 Maintainer

samlhuillier Feb 26, 2024 Author

giladgd Mar 2, 2024 Maintainer

samlhuillier Mar 2, 2024 Author

giladgd Mar 10, 2024 Maintainer

giladgd Apr 5, 2024 Maintainer

samlhuillier Feb 29, 2024 Author

samlhuillier
Feb 25, 2024

Replies: 3 comments 4 replies

giladgd
Feb 26, 2024
Maintainer

samlhuillier
Feb 26, 2024
Author

giladgd Mar 2, 2024
Maintainer

samlhuillier Mar 2, 2024
Author

giladgd Mar 10, 2024
Maintainer

giladgd Apr 5, 2024
Maintainer

samlhuillier
Feb 29, 2024
Author