-
Notifications
You must be signed in to change notification settings - Fork 1.2k
ggml: fix gradient allocation logic #966
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
ggml: fix gradient allocation logic #966
Conversation
I forgot: there is a similar issue when replacing the original gradient tensors during backwards graph construction when not using gradient accumulation. The original gradient tensors with |
Would it be simpler to add a flag to |
Do you mean skip their creation or skip their allocation? |
The creation. It would be a flag such as |
That would work for eliminating the need for a tensor flag but it would still require a change in the function for each GGML op and I personally think it would be preferable not to add state to |
Generally I would agree that it is preferable to have pure functions that have no state, but this is a fairly simple state. I have some issues with this approach:
I think that adding a |
How about this: remove the gradient logic from the forward pass construction completely and instead replace it with a pass over the forward graph in |
That sounds good to me. I am assuming that not too many operations would require specific handling (to exclude some of its parameters I imagine), but either way that could be refactored in the future into objects (or types) that have all the details of an operation. |
If we remove the gradient logic from the forward graph construction I think we should make removing |
Yes, storing |
diff --git a/src/ggml.c b/src/ggml.c
index 5307791..b9f84e9 100644
--- a/src/ggml.c
+++ b/src/ggml.c
@@ -7749,18 +7749,6 @@ struct ggml_tensor * ggml_opt_step_adamw(
return result;
}
-////////////////////////////////////////////////////////////////////////////////
-
-void ggml_set_param(struct ggml_context * ctx, struct ggml_tensor * tensor) {
- tensor->flags |= GGML_TENSOR_FLAG_PARAM;
-}
-
-void ggml_set_loss(struct ggml_tensor * tensor) {
- GGML_ASSERT(ggml_is_scalar(tensor));
- GGML_ASSERT(tensor->type == GGML_TYPE_F32);
- tensor->flags |= GGML_TENSOR_FLAG_LOSS;
-}
-
// ggml_compute_forward_dup
static void ggml_compute_forward_dup_same_cont(
@@ -18575,19 +18563,18 @@ void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph *
struct ggml_tensor * node = gf->nodes[i];
bool needs_grad = node->flags & GGML_TENSOR_FLAG_PARAM;
- bool ignore_src0 = false;
- bool ignore_src1 = false;
+ bool ignore_src[GGML_MAX_SRC] = {false};
switch (node->op) {
// gradients in node->src[0] for one reason or another have no effect on output gradients
case GGML_OP_IM2COL: // only used for its shape
case GGML_OP_IM2COL_BACK: // same as IM2COL
- ignore_src0 = true;
+ ignore_src[0] = true;
break;
case GGML_OP_UNARY: {
const enum ggml_unary_op uop = ggml_get_unary_op(node);
// SGN and STEP unary ops are piecewise constant
if (uop == GGML_UNARY_OP_SGN || uop == GGML_UNARY_OP_STEP) {
- ignore_src0 = true;
+ ignore_src[0] = true;
}
} break;
@@ -18596,20 +18583,14 @@ void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph *
case GGML_OP_GET_ROWS: // row indices not differentiable
case GGML_OP_GET_ROWS_BACK: // same as for GET_ROWS
case GGML_OP_ROPE: // positions not differentiable
- ignore_src1 = true;
+ ignore_src[1] = true;
break;
default:
break;
}
for (int j = 0; j < GGML_MAX_SRC; ++j) {
- if (j == 0 && ignore_src0) {
- continue;
- }
- if (j == 1 && ignore_src1) {
- continue;
- }
- if (!node->src[j] || !node->src[j]->grad) {
+ if (!node->src[j] || !node->src[j]->grad || ignore_src[j]) {
continue;
}
GGML_ASSERT(node->src[j]->type == GGML_TYPE_F32 || node->src[j]->type == GGML_TYPE_F16);
@@ -21582,6 +21563,17 @@ void ggml_set_output(struct ggml_tensor * tensor) {
tensor->flags |= GGML_TENSOR_FLAG_OUTPUT;
}
+void ggml_set_param(struct ggml_context * ctx, struct ggml_tensor * tensor) {
+ GGML_UNUSED(ctx); // TODO: remove this parameter
+ tensor->flags |= GGML_TENSOR_FLAG_PARAM;
+}
+
+void ggml_set_loss(struct ggml_tensor * tensor) {
+ GGML_ASSERT(ggml_is_scalar(tensor));
+ GGML_ASSERT(tensor->type == GGML_TYPE_F32);
+ tensor->flags |= GGML_TENSOR_FLAG_LOSS;
+}
+
////////////////////////////////////////////////////////////////////////////////
void ggml_quantize_init(enum ggml_type type) { |
The argument |
I think the PR is currently at a good point for reviewing/merging.
Actually, after looking at the code I think it will still work. |
The make -j && ./bin/test-backend-ops -o OPT_STEP_ADAMW
Backend 2/2 (CUDA0)
Backend name: CUDA0
OPT_STEP_ADAMW(type=f32,ne=[10,5,4,3],alpha=1.000000,beta1=0.001000,beta2=0.900000,eps=0.999000,wd=0.000000): Segmentation fault (core dumped) Investigating. |
It's a relatively simple fix, see #974 . |
On master the general logic for determining whether a tensor should receive gradients is as follows: parameters are given gradients, and tensors where at least one source has gradients are also given gradients. This works correctly for the forward pass. But for the backward pass this logic is unfortunately incorrect because for many operations the backwards pass uses the same operations as the forward pass and also tensors from the forward pass as sources. As a consequence gradients are determined to also need gradients and this propagates for the rest of the backward pass. The consequence is that with the code on master a lot of extra tensors are created and allocated that are not actually needed for anything. With code making use of
ggml_backend_sched
there is no excessive memory allocation because only the tensors in a specific graph are allocated but the correctly allocated tensors have pointers to unallocated tensors which then causes problems withggml_graph_reset
.I think the correct way to fix these problems is to change the logic for determining whether or not a tensor should receive gradients upon creation. First, explicitly mark gradients as such with a tensor flag. During the backwards pass at least one of the source tensors will be the gradients of another tensor, so in those cases gradients for the newly created tensor are never added. Otherwise, use the same logic as on master where gradients are added if at least one source has gradients.
Unfortunately the logic on master is currently duplicated for each GGML op so the above change requires changing a large number of lines in
ggml.c
. I wrote a small utility functionggml_set_grad
that can be applied after tensor creation to add gradients since the logic should be the same regardless of the specific GGML op. This function also asserts that the operation is not in-place since this is currently not being handled correctly (on master the combination of in-place operations and gradients sometimes causes a failed assert and sometimes just discards the gradients). Note that even without the changes in this PR a function likeggml_set_grad
will likely become necessary in the future anyways for specifying different data types for gradients and weights.While going through the code I also fixed the formatting as best as I could.