Skip to content

Improve reporting errors up through the stack #500

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
kbenzie opened this issue May 9, 2023 · 6 comments · Fixed by #589
Closed

Improve reporting errors up through the stack #500

kbenzie opened this issue May 9, 2023 · 6 comments · Fixed by #589
Assignees
Labels
enhancement New feature or request pi DPC++ PI requirement specification Changes or additions to the specification
Milestone

Comments

@kbenzie
Copy link
Contributor

kbenzie commented May 9, 2023

Currently UR adapters make a best effort to map driver specific errors to ur_result_t enumerations and return those to the parallel language runtime on top of UR. This approach is problematic as it obscures details about how an adapter is using a driver from the user, details which are very often necessary to determine how to resolve a given error condition.

This purpose of this issue is to track proposed solutions, the decision making process, and to determine the next steps to enable improved error reporting.

This is blocking progress on the design of #68 and is related to #471.

@kbenzie kbenzie added enhancement New feature or request needs-discussion This needs further discussion pi DPC++ PI requirement specification Changes or additions to the specification labels May 9, 2023
@kbenzie kbenzie changed the title Improve error reporting errors up through the stack Improve reporting errors up through the stack May 9, 2023
@alycm
Copy link
Contributor

alycm commented May 17, 2023

  • Already have a UR_BACKEND_SPECIFIC_ERROR enum to indicate when string can be returned
  • Still need to get error codes back from UR for native backend error codes, ideally not by parsing the error string.

@kbenzie kbenzie removed the needs-discussion This needs further discussion label May 17, 2023
@JackAKirk
Copy link
Contributor

JackAKirk commented May 19, 2023

L0 already uses UR_BACKEND_SPECIFIC_ERROR consistently. The pi_cuda and pi_hip mostly don't use it. Once the PI is ported to UR it might be nice if UR did something like this in the cuda/hip cases: https://github.com/intel/llvm/pull/8303/files#diff-7525901710934f7bdb2ad36238c4b67163f112d3bd233db7af0b0078b5b01e80R5920

So a suggestion is to go through and find all places where e.g. cu driver functions are used, record what they return as cu_res, and insert

    if (cu_res != CUDA_SUCCESS) {
      setPluginSpecificMessage(cu_res);
      return PI_ERROR_PLUGIN_SPECIFIC_ERROR;
    }

after all such calls. If that makes sense. Then we could clean up the PI removing all other error reporting machinary like map_error and check_error etc. I'm not sure if this might simplify the porting work to think about this, but will leave it up to you to decide what is best!

The format for setPluginSpecificMessage is open for suggestions. We could make it a bit more verbose like this: https://github.com/intel/llvm/blob/484cf252246a958b089a8e94e35b14bd791a213c/sycl/plugins/cuda/pi_cuda.cpp#L181
But there is a 256 char limit to be aware of.

@kbenzie
Copy link
Contributor Author

kbenzie commented May 25, 2023

Since I've not managed to get round to creating a full PR yet, here it is in shortform:

-ur_result_t urGetLastResult(
+ur_result_t urPlatformGetLastError(
     hPlatform,
     const char** ppMessage,
+    const int32_t *pError
);

@igchor
Copy link
Member

igchor commented Jun 1, 2023

@pbalcer @vinser52 I think we might also need something similar for UMA (a way to return native error code). When using UMA memory provider on top of L0 we lose L0-specific error codes and L0 adapter might want to know exactly, why the allocation failed.

@vinser52
Copy link
Contributor

vinser52 commented Jun 2, 2023

@pbalcer @vinser52 I think we might also need something similar for UMA (a way to return native error code). When using UMA memory provider on top of L0 we lose L0-specific error codes and L0 adapter might want to know exactly, why the allocation failed.

As I remember each memory provider has a get_last_error interface. Isn't it what is needed?

@igchor
Copy link
Member

igchor commented Jun 2, 2023

@pbalcer @vinser52 I think we might also need something similar for UMA (a way to return native error code). When using UMA memory provider on top of L0 we lose L0-specific error codes and L0 adapter might want to know exactly, why the allocation failed.

As I remember each memory provider has a get_last_error interface. Isn't it what is needed?

Yes, but get_last_error only returns a string describing the error and in some cases (like for L0 adapter, where we know that we are using an L0 memory provider) we might need to get the actual error status (as an int).

kbenzie added a commit to kbenzie/unified-runtime that referenced this issue Jun 9, 2023
Remove the `urGetLastResult()` entry-point and replace it with
`urPlatformGetLastError()`. This primary difference is the addition of
the `pError` out parameter which returns an error code emitted from a
failed driver entry-point which resulted in a Unified Runtime
entry-point returning `UR_RESULT_ERROR_ADAPTER_SPECIFIC`.

Fixes oneapi-src#500.
kbenzie added a commit to kbenzie/unified-runtime that referenced this issue Jun 12, 2023
Remove the `urGetLastResult()` entry-point and replace it with
`urPlatformGetLastError()`. This primary difference is the addition of
the `pError` out parameter which returns an error code emitted from a
failed driver entry-point which resulted in a Unified Runtime
entry-point returning `UR_RESULT_ERROR_ADAPTER_SPECIFIC`.

Fixes oneapi-src#500.
@kbenzie kbenzie added this to the 0.7 milestone Aug 3, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request pi DPC++ PI requirement specification Changes or additions to the specification
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants