First rough draft of recoverable errors feature. #1325

KerfuffleV2 · 2023-05-04T20:53:00Z

Unfortunately (like always) this turned out to be more complicated than I initially expected.

At first I was planning to make it possible to acknowledge and clear an error, but there are some cases where recovery seems impossible like my map functions for example: Because they have to create another tensor and there's no way to free a tensor clearing the error would still leave that tensor allocated.

Probably need to scrap that plan and say "Once an error occurs, you need to free that ggml_context and start over". I think that's fine. (Right now those specific functions will abort if tensor creation fails after the first step, but assuming the context is just considered dead then returning NULL will be okay.)

Also, the way I implemented this is if you ignore an error and just keep trying to use the context, then it will abort.

Right now, this just adds the basic scaffolding. I also had to add NULL propagation to basically all functions that make a tensor. This could be simplified with a define instead of having to have a bunch of if (blah == NULL) return NULL; statements but for this sketch I'm keeping it simple.

This converts the asserts in ggml_add_impl and ggml_new_tensor_impl to be recoverable. There's also code cleanup, etc that would need to be done before this is actually ready to become a real pull.

Right now GGML_RECOVERABLE_ERRORS is just #defineed for testing. I have verified that it compiles (on Linux at least) and can run a model. I haven't tested triggering the asserts yet.

Before I continue with the rest, I want to make sure that I'm generally on the right track. What do you think?

KerfuffleV2 · 2023-05-04T21:06:04Z

There are two possible alternative approaches instead of just always marking the context as dead if an error occurs:

Have a special permanent flag that can't be cleared and allow most errors to be recovered from. You could just get unlucky and have something like ggml_map_unary_f32 succeed grabbing the addr tensor then fail with the map tensor itself.
There are only a couple places where this is a problem. Maybe there's a way to pre-check all the conditions initially before proceeding to make sure the operation fails before it does anything that can't be reversed.

Those are technically possible (and I'd be willing to work on it if you have strong feelings) but personally it's probably not worth making things excessively complicated. I don't like the first option because it's basically random if you get screwed over and there needs to be special handling to deal with that case.

j-f1 · 2023-05-05T00:41:38Z

ggml.h

+    do { \
+        if (!(x)) { \
+            printf(__VA_ARGS__); \
+            fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \


could these be defined as e.g.

Suggested change

fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \

fprintf(stderr, "GGML_ASSERT: %s:%d: %s: %s\n", __FILE__, __LINE__, __func__, #x); \

? That way, asserts wouldn’t have to handle the function name boilerplate.

KerfuffleV2 · 2023-05-05T08:34:13Z

I just had an idea for an approach that's probably better (and also likely useful for other stuff too). Add a function to do the size calculation part without actually adding a tensor. ggml_new_tensor_impl could also just use that without having to duplicate code. I think there's also a clever way to make it possible to check if multiple tensors fit before committing to anything.

Something like:

bool ggml_tensor * ggml_ensure_tensor_memory(
        struct ggml_context * ctx,
        enum   ggml_type type,
        int    n_dims,
        const int64_t* ne,
        int *ctx_required,
        int *scratch_required) {
        // etc
        *ctx_required += calculated_ctx_required;
        *scratch_required += calculated_scratch_required;
        return *ctx_required <= available_ctx && *scratch_required <= available_scratch;
}

That way it would be possible to call the function repeatedly as long as it returned true and it would just update the required memory arguments (which would be set to 0 initially).

I actually like this idea and I think it would be pretty easy to make those few functions that are currently hard to deal with safely retryable.

ggml.h

ggml.c

ggerganov · 2023-05-06T06:29:51Z

ggml.c

@@ -11741,6 +11816,9 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
 }

 void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph) {
+#ifdef GGML_RECOVERABLE_ERRORS
+    GGML_ASSERT(ctx->last_error_code == GGML_ERRCODE_SUCCESS);
+#endif


Don't think this is needed

The way it works currently is setting an error status in the context when an error occurs. When the error occurs, that operation didn't complete successfully.

Imagine this scenario:

Create tensor 1 (succeeds)

Create tensor 2 (fails)

Run the graph with ggml_graph_compute

At step 3, we didn't actually build the graph successfully. We can't compute it, because tensor 2 is missing or in an invalid state. Right?

Now, if it had been tensor 1 that failed, trying to create tensor 2 would have run into the assert and crashed the process: step 3 would never be reached so that's okay.

I will make the other changes you suggested so the API stays the same. Can I take this response as indicating you don't have a problem with the general approach I'm using and I should continue to develop it?

edit: Also, do you have an opinion on the clearing error conditions stuff and approach to take with that?

…ERRORS is defined.

Change return from ggml_last_error_msg to be const. Cast returning error msg buffer to avoid a compiler warning.

KerfuffleV2 · 2023-05-25T06:13:47Z

Closing this, at least for now as it doesn't seem like there's interest.

j-f1 reviewed May 5, 2023

View reviewed changes

ggerganov reviewed May 6, 2023

View reviewed changes

KerfuffleV2 added 2 commits May 6, 2023 15:50

First rough draft of recoverable errors feature.

1a6987a

Keep API and context fields the same whether or not GGML_RECOVERABLE_…

7523107

…ERRORS is defined.

KerfuffleV2 force-pushed the feat-recoverable-errors branch from dc5aed7 to 7523107 Compare May 6, 2023 21:51

There were still struct fields and defines conditionally enabled.

30b2b3d

Change return from ggml_last_error_msg to be const. Cast returning error msg buffer to avoid a compiler warning.

KerfuffleV2 mentioned this pull request May 7, 2023

Allow prechecking how much memory tensor creation will require #1349

Closed

KerfuffleV2 closed this May 25, 2023

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First rough draft of recoverable errors feature. #1325

First rough draft of recoverable errors feature. #1325

KerfuffleV2 commented May 4, 2023 •

edited

Loading

KerfuffleV2 commented May 4, 2023

j-f1 May 5, 2023

KerfuffleV2 commented May 5, 2023

ggerganov May 6, 2023

KerfuffleV2 May 6, 2023 •

edited

Loading

KerfuffleV2 commented May 25, 2023

	fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
	fprintf(stderr, "GGML_ASSERT: %s:%d: %s: %s\n", __FILE__, __LINE__, __func__, #x); \

First rough draft of recoverable errors feature. #1325

First rough draft of recoverable errors feature. #1325

Conversation

KerfuffleV2 commented May 4, 2023 • edited Loading

KerfuffleV2 commented May 4, 2023

j-f1 May 5, 2023

Choose a reason for hiding this comment

KerfuffleV2 commented May 5, 2023

ggerganov May 6, 2023

Choose a reason for hiding this comment

KerfuffleV2 May 6, 2023 • edited Loading

Choose a reason for hiding this comment

KerfuffleV2 commented May 25, 2023

KerfuffleV2 commented May 4, 2023 •

edited

Loading

KerfuffleV2 May 6, 2023 •

edited

Loading