-
Notifications
You must be signed in to change notification settings - Fork 11.4k
First rough draft of recoverable errors feature. #1325
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
There are two possible alternative approaches instead of just always marking the context as dead if an error occurs:
Those are technically possible (and I'd be willing to work on it if you have strong feelings) but personally it's probably not worth making things excessively complicated. I don't like the first option because it's basically random if you get screwed over and there needs to be special handling to deal with that case. |
do { \ | ||
if (!(x)) { \ | ||
printf(__VA_ARGS__); \ | ||
fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could these be defined as e.g.
fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \ | |
fprintf(stderr, "GGML_ASSERT: %s:%d: %s: %s\n", __FILE__, __LINE__, __func__, #x); \ |
? That way, asserts wouldn’t have to handle the function name boilerplate.
I just had an idea for an approach that's probably better (and also likely useful for other stuff too). Add a function to do the size calculation part without actually adding a tensor. Something like: bool ggml_tensor * ggml_ensure_tensor_memory(
struct ggml_context * ctx,
enum ggml_type type,
int n_dims,
const int64_t* ne,
int *ctx_required,
int *scratch_required) {
// etc
*ctx_required += calculated_ctx_required;
*scratch_required += calculated_scratch_required;
return *ctx_required <= available_ctx && *scratch_required <= available_scratch;
} That way it would be possible to call the function repeatedly as long as it returned true and it would just update the required memory arguments (which would be set to 0 initially). I actually like this idea and I think it would be pretty easy to make those few functions that are currently hard to deal with safely retryable. |
@@ -11741,6 +11816,9 @@ static thread_ret_t ggml_graph_compute_thread(void * data) { | |||
} | |||
|
|||
void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph) { | |||
#ifdef GGML_RECOVERABLE_ERRORS | |||
GGML_ASSERT(ctx->last_error_code == GGML_ERRCODE_SUCCESS); | |||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think this is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it works currently is setting an error status in the context when an error occurs. When the error occurs, that operation didn't complete successfully.
Imagine this scenario:
- Create tensor 1 (succeeds)
- Create tensor 2 (fails)
- Run the graph with
ggml_graph_compute
At step 3, we didn't actually build the graph successfully. We can't compute it, because tensor 2 is missing or in an invalid state. Right?
Now, if it had been tensor 1 that failed, trying to create tensor 2 would have run into the assert and crashed the process: step 3 would never be reached so that's okay.
I will make the other changes you suggested so the API stays the same. Can I take this response as indicating you don't have a problem with the general approach I'm using and I should continue to develop it?
edit: Also, do you have an opinion on the clearing error conditions stuff and approach to take with that?
dc5aed7
to
7523107
Compare
Change return from ggml_last_error_msg to be const. Cast returning error msg buffer to avoid a compiler warning.
Closing this, at least for now as it doesn't seem like there's interest. |
@ggerganov
Context: ggml-org/ggml#123
Unfortunately (like always) this turned out to be more complicated than I initially expected.
At first I was planning to make it possible to acknowledge and clear an error, but there are some cases where recovery seems impossible like my map functions for example: Because they have to create another tensor and there's no way to free a tensor clearing the error would still leave that tensor allocated.
Probably need to scrap that plan and say "Once an error occurs, you need to free that
ggml_context
and start over". I think that's fine. (Right now those specific functions will abort if tensor creation fails after the first step, but assuming the context is just considered dead then returningNULL
will be okay.)Also, the way I implemented this is if you ignore an error and just keep trying to use the context, then it will abort.
Right now, this just adds the basic scaffolding. I also had to add
NULL
propagation to basically all functions that make a tensor. This could be simplified with a define instead of having to have a bunch ofif (blah == NULL) return NULL;
statements but for this sketch I'm keeping it simple.This converts the asserts in
ggml_add_impl
andggml_new_tensor_impl
to be recoverable. There's also code cleanup, etc that would need to be done before this is actually ready to become a real pull.Right now
GGML_RECOVERABLE_ERRORS
is just#define
ed for testing. I have verified that it compiles (on Linux at least) and can run a model. I haven't tested triggering the asserts yet.Before I continue with the rest, I want to make sure that I'm generally on the right track. What do you think?