Description
Currently, the weights are uploaded to VRAM every time that they are used. This usually represents ~30% of the time spent in most mat muls with cuBLAS:
This could be improved by keeping the weights VRAM, however there are two issues blocking this:
- We need to be able to identify with certainty what tensors are constant / weights
- We need to be able to identify when the weights change. With llama.cpp, this can happen after applying a LoRA
Issue 1 can be solved in a not very clean way with #1268 by looking at the matrices that have names ending in ".weight". We could also add some kind of is_weight
flag to ggml_tensor
, but that would be more intrusive.
Issue 2 is more complicated. We would need either to add some kind of is_dirty
flag to ggml_tensor
that would be set automatically by the operations that modify a tensor (such as ggml_cpy
and _inplace
ops), or we could add a global flag to ggml-cuda
that would trigger a full weight re-upload, that would need to be set by the application when the weights change.
The is_dirty
flag would be more intrusive to ggml, essentially we would be adding GPU-specific details to ggml, but would be automatic for downstream users. The global CUDA-only flag would be less intrusive to ggml, but would force users to deal with this manually.
Any thoughts about this? What would be the best way to add this to ggml, while not interfering with the general goal of not adding GPU-specific details to ggml?