ggml: fix gradient allocation logic #966

JohannesGaessler · 2024-09-23T14:08:35Z

On master the general logic for determining whether a tensor should receive gradients is as follows: parameters are given gradients, and tensors where at least one source has gradients are also given gradients. This works correctly for the forward pass. But for the backward pass this logic is unfortunately incorrect because for many operations the backwards pass uses the same operations as the forward pass and also tensors from the forward pass as sources. As a consequence gradients are determined to also need gradients and this propagates for the rest of the backward pass. The consequence is that with the code on master a lot of extra tensors are created and allocated that are not actually needed for anything. With code making use of ggml_backend_sched there is no excessive memory allocation because only the tensors in a specific graph are allocated but the correctly allocated tensors have pointers to unallocated tensors which then causes problems with ggml_graph_reset.

I think the correct way to fix these problems is to change the logic for determining whether or not a tensor should receive gradients upon creation. First, explicitly mark gradients as such with a tensor flag. During the backwards pass at least one of the source tensors will be the gradients of another tensor, so in those cases gradients for the newly created tensor are never added. Otherwise, use the same logic as on master where gradients are added if at least one source has gradients.

Unfortunately the logic on master is currently duplicated for each GGML op so the above change requires changing a large number of lines in ggml.c. I wrote a small utility function ggml_set_grad that can be applied after tensor creation to add gradients since the logic should be the same regardless of the specific GGML op. This function also asserts that the operation is not in-place since this is currently not being handled correctly (on master the combination of in-place operations and gradients sometimes causes a failed assert and sometimes just discards the gradients). Note that even without the changes in this PR a function like ggml_set_grad will likely become necessary in the future anyways for specifying different data types for gradients and weights.

While going through the code I also fixed the formatting as best as I could.

JohannesGaessler · 2024-09-23T14:13:58Z

I forgot: there is a similar issue when replacing the original gradient tensors during backwards graph construction when not using gradient accumulation. The original gradient tensors with GGML_OP_NONE are replaced with tensors that actually calculate the gradients but the original gradient tensors are not freed and are thus wasting memory (unless you're using ggml_backend_sched). This is not addressed in this PR.

slaren · 2024-09-23T14:31:41Z

Would it be simpler to add a flag to ggml_context to skip gradients?

JohannesGaessler · 2024-09-23T14:41:57Z

Do you mean skip their creation or skip their allocation?

slaren · 2024-09-23T14:45:22Z

The creation. It would be a flag such as no_grad that ops would check before creating the grad tensor. During backwards expansion it would be set automatically.

JohannesGaessler · 2024-09-23T14:51:34Z

That would work for eliminating the need for a tensor flag but it would still require a change in the function for each GGML op and I personally think it would be preferable not to add state to ggml_context if it can be avoided. Currently the creation of tensors is fully determined by the inputs to the corresponding functions (except for resource allocation) and I think that that is a design that is easier to debug than one where the results also depend on the internal state of ggml_context.

slaren · 2024-09-23T15:29:46Z

Generally I would agree that it is preferable to have pure functions that have no state, but this is a fairly simple state. I have some issues with this approach:

It will add overhead even during inference, which is something we need to reduce
"To ignore tensors for gradients add them after calling this function" seems like a hack, and it will prevent refactoring in the future (eg. I would like to make the sources a parameter to ggml_new_tensor, but this will make that harder)
The logic that "if any source is a gradient the newly created tensor must also be a gradient" is not very obvious to me

I think that adding a no_grad state would be simpler overall, but maybe there are better options.

JohannesGaessler · 2024-09-23T16:10:14Z

How about this: remove the gradient logic from the forward pass construction completely and instead replace it with a pass over the forward graph in ggml_build_backward. That way gradients will not be added to gradients, there is no need for an internal ggml_context state, and inference should become faster because it would eliminate the gradient checks on master. The downside would be that the code for determining which tensors should get gradients would no longer be in the same place as the code that defines the tensor parameterization.

slaren · 2024-09-23T16:33:51Z

That sounds good to me. I am assuming that not too many operations would require specific handling (to exclude some of its parameters I imagine), but either way that could be refactored in the future into objects (or types) that have all the details of an operation.

JohannesGaessler · 2024-09-24T11:53:48Z

If we remove the gradient logic from the forward graph construction I think we should make removing ggml_tensor.grad one of our long-term goals. The mapping from a tensor to its gradient only makes sense and is needed in the context of a compute graph so I think that that is where this information should be stored. I went through the uses of ggml_tensor.grad and based on that I think this change would not cause problems (it would just be tedious).

ggerganov · 2024-09-25T06:47:25Z

Yes, storing grad in ggml_tensor was not a good design decision and we should eventually remove it from there.

slaren · 2024-09-26T16:45:39Z

diff --git a/src/ggml.c b/src/ggml.c
index 5307791..b9f84e9 100644
--- a/src/ggml.c
+++ b/src/ggml.c
@@ -7749,18 +7749,6 @@ struct ggml_tensor * ggml_opt_step_adamw(
     return result;
 }
 
-////////////////////////////////////////////////////////////////////////////////
-
-void ggml_set_param(struct ggml_context * ctx, struct ggml_tensor * tensor) {
-    tensor->flags |= GGML_TENSOR_FLAG_PARAM;
-}
-
-void ggml_set_loss(struct ggml_tensor * tensor) {
-    GGML_ASSERT(ggml_is_scalar(tensor));
-    GGML_ASSERT(tensor->type == GGML_TYPE_F32);
-    tensor->flags |= GGML_TENSOR_FLAG_LOSS;
-}
-
 // ggml_compute_forward_dup
 
 static void ggml_compute_forward_dup_same_cont(
@@ -18575,19 +18563,18 @@ void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph *
         struct ggml_tensor * node = gf->nodes[i];
 
         bool needs_grad = node->flags & GGML_TENSOR_FLAG_PARAM;
-        bool ignore_src0 = false;
-        bool ignore_src1 = false;
+        bool ignore_src[GGML_MAX_SRC] = {false};
         switch (node->op) {
             // gradients in node->src[0] for one reason or another have no effect on output gradients
             case GGML_OP_IM2COL:      // only used for its shape
             case GGML_OP_IM2COL_BACK: // same as IM2COL
-                ignore_src0 = true;
+                ignore_src[0] = true;
                 break;
             case GGML_OP_UNARY: {
                 const enum ggml_unary_op uop = ggml_get_unary_op(node);
                 // SGN and STEP unary ops are piecewise constant
                 if (uop == GGML_UNARY_OP_SGN || uop == GGML_UNARY_OP_STEP) {
-                    ignore_src0 = true;
+                    ignore_src[0] = true;
                 }
             } break;
 
@@ -18596,20 +18583,14 @@ void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph *
             case GGML_OP_GET_ROWS:      // row indices not differentiable
             case GGML_OP_GET_ROWS_BACK: // same as for GET_ROWS
             case GGML_OP_ROPE:          // positions not differentiable
-                ignore_src1 = true;
+                ignore_src[1] = true;
                 break;
 
             default:
                 break;
         }
         for (int j = 0; j < GGML_MAX_SRC; ++j) {
-            if (j == 0 && ignore_src0) {
-                continue;
-            }
-            if (j == 1 && ignore_src1) {
-                continue;
-            }
-            if (!node->src[j] || !node->src[j]->grad) {
+            if (!node->src[j] || !node->src[j]->grad || ignore_src[j]) {
                 continue;
             }
             GGML_ASSERT(node->src[j]->type == GGML_TYPE_F32 || node->src[j]->type == GGML_TYPE_F16);
@@ -21582,6 +21563,17 @@ void ggml_set_output(struct ggml_tensor * tensor) {
     tensor->flags |= GGML_TENSOR_FLAG_OUTPUT;
 }
 
+void ggml_set_param(struct ggml_context * ctx, struct ggml_tensor * tensor) {
+    GGML_UNUSED(ctx); // TODO: remove this parameter
+    tensor->flags |= GGML_TENSOR_FLAG_PARAM;
+}
+
+void ggml_set_loss(struct ggml_tensor * tensor) {
+    GGML_ASSERT(ggml_is_scalar(tensor));
+    GGML_ASSERT(tensor->type == GGML_TYPE_F32);
+    tensor->flags |= GGML_TENSOR_FLAG_LOSS;
+}
+
 ////////////////////////////////////////////////////////////////////////////////
 
 void ggml_quantize_init(enum ggml_type type) {

JohannesGaessler · 2024-09-26T20:00:42Z

The argument keep doesn't make sense if the gradients are only set during backwards graph construction. It was used in test1.c but I think it's fine to not use it. However, some of the old training API around gradient checkpointing will be broken by this change.

JohannesGaessler · 2024-09-26T21:42:00Z

I think the PR is currently at a good point for reviewing/merging.

However, some of the old training API around gradient checkpointing will be broken by this change.

Actually, after looking at the code I think it will still work.

ggerganov · 2024-09-30T07:12:56Z

The OPT_STEP_ADAMW test is failing on master:

make -j  && ./bin/test-backend-ops -o OPT_STEP_ADAMW

Backend 2/2 (CUDA0)
  Backend name: CUDA0
  OPT_STEP_ADAMW(type=f32,ne=[10,5,4,3],alpha=1.000000,beta1=0.001000,beta2=0.900000,eps=0.999000,wd=0.000000): Segmentation fault (core dumped)

Investigating.

JohannesGaessler · 2024-09-30T07:46:22Z

It's a relatively simple fix, see #974 .

ggml: fix gradient allocation logic

a19c280

JohannesGaessler added 3 commits September 25, 2024 09:12

gradient allocation in ggml_build_backward_expand

32ee82d

fixup

26fb0b7

fix test-backend-ops grad

9c08e67

JohannesGaessler added 2 commits September 26, 2024 21:38

suggestions by slaren

23e91d0

fix test1.c

09c8219

JohannesGaessler added 3 commits September 26, 2024 22:06

fix legacy opt API

351539f

fix test-grad0

5f87227

remove keep arg

dc6c7eb

This was referenced Sep 28, 2024

Bug: baby-llama fails ggml-org/llama.cpp#9674

Closed

A specialized Winograd Conv2d op #971

Draft

ggerganov approved these changes Sep 29, 2024

View reviewed changes

slaren approved these changes Sep 29, 2024

View reviewed changes

JohannesGaessler merged commit 6fe608e into ggml-org:master Sep 29, 2024
4 checks passed

iboB mentioned this pull request Sep 30, 2024

ggml : fix ggml_cast #973

Merged

JohannesGaessler mentioned this pull request Sep 30, 2024

test: fix OPT_STEP_ADAMW for test-backend-ops #974

Merged

danbev mentioned this pull request Sep 30, 2024

Bug: There is an issue to execute llama-baby-llama. ggml-org/llama.cpp#9478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: fix gradient allocation logic #966

ggml: fix gradient allocation logic #966

JohannesGaessler commented Sep 23, 2024

JohannesGaessler commented Sep 23, 2024 •

edited

Loading

slaren commented Sep 23, 2024

JohannesGaessler commented Sep 23, 2024

slaren commented Sep 23, 2024

JohannesGaessler commented Sep 23, 2024

slaren commented Sep 23, 2024

JohannesGaessler commented Sep 23, 2024

slaren commented Sep 23, 2024 •

edited

Loading

JohannesGaessler commented Sep 24, 2024

ggerganov commented Sep 25, 2024

slaren commented Sep 26, 2024

JohannesGaessler commented Sep 26, 2024

JohannesGaessler commented Sep 26, 2024

ggerganov commented Sep 30, 2024

JohannesGaessler commented Sep 30, 2024

ggml: fix gradient allocation logic #966

ggml: fix gradient allocation logic #966

Conversation

JohannesGaessler commented Sep 23, 2024

JohannesGaessler commented Sep 23, 2024 • edited Loading

slaren commented Sep 23, 2024

JohannesGaessler commented Sep 23, 2024

slaren commented Sep 23, 2024

JohannesGaessler commented Sep 23, 2024

slaren commented Sep 23, 2024

JohannesGaessler commented Sep 23, 2024

slaren commented Sep 23, 2024 • edited Loading

JohannesGaessler commented Sep 24, 2024

ggerganov commented Sep 25, 2024

slaren commented Sep 26, 2024

JohannesGaessler commented Sep 26, 2024

JohannesGaessler commented Sep 26, 2024

ggerganov commented Sep 30, 2024

JohannesGaessler commented Sep 30, 2024

JohannesGaessler commented Sep 23, 2024 •

edited

Loading

slaren commented Sep 23, 2024 •

edited

Loading