Skip to content

cuBLAS: keep the weights in VRAM when possible #1269

Closed
@slaren

Description

@slaren

Currently, the weights are uploaded to VRAM every time that they are used. This usually represents ~30% of the time spent in most mat muls with cuBLAS:

image

This could be improved by keeping the weights VRAM, however there are two issues blocking this:

  1. We need to be able to identify with certainty what tensors are constant / weights
  2. We need to be able to identify when the weights change. With llama.cpp, this can happen after applying a LoRA

Issue 1 can be solved in a not very clean way with #1268 by looking at the matrices that have names ending in ".weight". We could also add some kind of is_weight flag to ggml_tensor, but that would be more intrusive.

Issue 2 is more complicated. We would need either to add some kind of is_dirty flag to ggml_tensor that would be set automatically by the operations that modify a tensor (such as ggml_cpy and _inplace ops), or we could add a global flag to ggml-cuda that would trigger a full weight re-upload, that would need to be set by the application when the weights change.

The is_dirty flag would be more intrusive to ggml, essentially we would be adding GPU-specific details to ggml, but would be automatic for downstream users. The global CUDA-only flag would be less intrusive to ggml, but would force users to deal with this manually.

Any thoughts about this? What would be the best way to add this to ggml, while not interfering with the general goal of not adding GPU-specific details to ggml?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions