-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
CUDA: always create events for split buffers #10185
base: master
Are you sure you want to change the base?
CUDA: always create events for split buffers #10185
Conversation
bde4116
to
38d11f5
Compare
Qwen2.5-0.5B does not work with this change alone, it still crashes in the memcpy later:
|
It would also be possible to prevent using a split buffer entirely if the matrix is too small by returning |
38d11f5
to
e151321
Compare
LCPP PR by Johannes Gaessler.
ggml/src/ggml-cuda.cu
Outdated
// only use row split if the weight matrix is large enough for every GPU to get data (this solves some edge cases) | ||
// also for small matrices the overhead is very large anyways so splitting is slow | ||
if (a->buffer && ggml_backend_buft_is_cuda_split(a->buffer->buft)) { | ||
ggml_backend_cuda_split_buffer_type_context * buft_ctx = (ggml_backend_cuda_split_buffer_type_context *) a->buffer->buft->context; | ||
int64_t active_devices = 0; | ||
for (int id = 0; id < ggml_backend_cuda_get_device_count(); ++id) { | ||
int64_t row_low; | ||
int64_t row_high; | ||
get_row_split(&row_low, &row_high, a, buft_ctx->tensor_split, id); | ||
active_devices += row_low == row_high; | ||
} | ||
const int64_t rounding = get_row_rounding(buft_ctx->tensor_split); | ||
if (rounding*active_devices < a->ne[1]) { | ||
return false; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems too expensive to do in this function, since this is called many times during inference by ggml_backend_sched
. I think it should be possible to compute the minimum tensor size in ggml_backend_cuda_split_buffer_type
, and store it in ggml_backend_cuda_split_buffer_type_context
, then this function would only need to compare the tensor size to this value.
e151321
to
84bcad6
Compare
Fixes #10176 .
I think the correct way to fix it is to just create the events unconditionally. Regardless of how the data is split you always need the events on the currently active device for the other devices to wait on. You could maybe reduce the number of events by only initializing those that are actually needed but I don't think that would be worthwhile since for the vast majority of use cases all events are already being created and used anyways.