cuBLAS: keep the weights in VRAM when possible

Currently, the weights are uploaded to VRAM every time that they are used. This usually represents ~30% of the time spent in most mat muls with cuBLAS:

![image](https://user-images.githubusercontent.com/2141330/235488927-5ef379ca-6c60-433e-9d94-636f56ca6d46.png)

This could be improved by keeping the weights VRAM, however there are two issues blocking this:

1. We need to be able to identify with certainty what tensors are constant / weights
2. We need to be able to identify when the weights change. With llama.cpp, this can happen after applying a LoRA

Issue 1 can be solved in a not very clean way with #1268 by looking at the matrices that have names ending in ".weight". We could also add some kind of `is_weight` flag to `ggml_tensor`, but that would be more intrusive.

Issue 2 is more complicated. We would need either to add some kind of `is_dirty` flag to `ggml_tensor` that would be set automatically by the operations that modify a tensor (such as `ggml_cpy` and `_inplace` ops), or we could add a global flag to `ggml-cuda` that would trigger a full weight re-upload, that would need to be set by the application when the weights change.

The `is_dirty` flag would be more intrusive to ggml, essentially we would be adding GPU-specific details to ggml, but would be automatic for downstream users. The global CUDA-only flag would be less intrusive to ggml, but would force users to deal with this manually.

Any thoughts about this? What would be the best way to add this to ggml, while not interfering with the general goal of not adding GPU-specific details to ggml?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuBLAS: keep the weights in VRAM when possible #1269

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cuBLAS: keep the weights in VRAM when possible #1269

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions