-
Notifications
You must be signed in to change notification settings - Fork 11.5k
cuBLAS: keep the weights in VRAM when possible #1269
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Issue 1 should be solved by checking if the This indicates that the tensor is not the result of some operator. In theory, it can be an optimization parameter (this is not relevant for inference, but we should check anyway). In this case, the Therefore, the CUDA code should check: Will think more about issue 2 |
I think this would work for llama.cpp, but I think it may fail with out-of-graph, but not constant tensors. For example, the k/v cache. In our case it should still work because the k/v tensors go through some view/reshape/permute operations, so technically it isn't interpreted as a constant by the time it reaches the mat mul. But I think it wouldn't be reliable in every case.
This sounds exactly like what we need, but from what I see, |
Good point. We can add an enum value The An alternative solution is adding a
to determine if the tensor is constant. And we can also update Regarding issue 2: I am thinking about adding the |
What if we set |
Could work, but you will still need the Also, it is not obvious when you will clear the flag. |
The flag would be cleared in the GPU code after uploading the weights to VRAM. The logic would be something like this: if (!cached || t->is_dirty || (t->owner && t->owner->is_dirty)) {
upload_tensor(t);
t->is_dirty = false;
} |
The Let's start with manually setting |
Could it be done on a higher level from llama.cpp? Just like it manages the scratch buffers or the KV cache. It also knows exactly what order the weights are used and it could start loading the next layer's weights ahead of time, or even it could do that in the end: load the first ones again for the next evaluation. |
llama.cpp builds a graph which is then executed by ggml, it is not synchronous, so these instructions to "prefetch" the next layer weights would have to be in the graph, which would require adding GPU-specific ops to ggml. Eventually, I think what would be the fastest is hybrid GPU/CPU inference, in which some layers are processed entirely in the GPU (as much as the available VRAM allows) and the rest in the CPU. But currently that would have to be in a fork. |
It could be useful to prefetch data in other environments as well, low RAM devices: fetch from disk, low disk space: fetch from network, fetch from database. It could also be used to dequantize some formats where the whole tensor has to be processed instead of one row. It's a thought I had. |
I think it's a good idea, I am just not sure if it would fit in ggml as long as the goal is to keep GPU stuff separated. |
Currently, the weights are uploaded to VRAM every time that they are used. This usually represents ~30% of the time spent in most mat muls with cuBLAS:
This could be improved by keeping the weights VRAM, however there are two issues blocking this:
Issue 1 can be solved in a not very clean way with #1268 by looking at the matrices that have names ending in ".weight". We could also add some kind of
is_weight
flag toggml_tensor
, but that would be more intrusive.Issue 2 is more complicated. We would need either to add some kind of
is_dirty
flag toggml_tensor
that would be set automatically by the operations that modify a tensor (such asggml_cpy
and_inplace
ops), or we could add a global flag toggml-cuda
that would trigger a full weight re-upload, that would need to be set by the application when the weights change.The
is_dirty
flag would be more intrusive to ggml, essentially we would be adding GPU-specific details to ggml, but would be automatic for downstream users. The global CUDA-only flag would be less intrusive to ggml, but would force users to deal with this manually.Any thoughts about this? What would be the best way to add this to ggml, while not interfering with the general goal of not adding GPU-specific details to ggml?
The text was updated successfully, but these errors were encountered: