Improve reporting errors up through the stack #500

kbenzie · 2023-05-09T11:46:10Z

Currently UR adapters make a best effort to map driver specific errors to ur_result_t enumerations and return those to the parallel language runtime on top of UR. This approach is problematic as it obscures details about how an adapter is using a driver from the user, details which are very often necessary to determine how to resolve a given error condition.

This purpose of this issue is to track proposed solutions, the decision making process, and to determine the next steps to enable improved error reporting.

This is blocking progress on the design of #68 and is related to #471.

The text was updated successfully, but these errors were encountered:

alycm · 2023-05-17T14:39:00Z

Already have a UR_BACKEND_SPECIFIC_ERROR enum to indicate when string can be returned
Still need to get error codes back from UR for native backend error codes, ideally not by parsing the error string.

JackAKirk · 2023-05-19T16:38:57Z

L0 already uses UR_BACKEND_SPECIFIC_ERROR consistently. The pi_cuda and pi_hip mostly don't use it. Once the PI is ported to UR it might be nice if UR did something like this in the cuda/hip cases: https://github.com/intel/llvm/pull/8303/files#diff-7525901710934f7bdb2ad36238c4b67163f112d3bd233db7af0b0078b5b01e80R5920

So a suggestion is to go through and find all places where e.g. cu driver functions are used, record what they return as cu_res, and insert

    if (cu_res != CUDA_SUCCESS) {
      setPluginSpecificMessage(cu_res);
      return PI_ERROR_PLUGIN_SPECIFIC_ERROR;
    }

after all such calls. If that makes sense. Then we could clean up the PI removing all other error reporting machinary like map_error and check_error etc. I'm not sure if this might simplify the porting work to think about this, but will leave it up to you to decide what is best!

The format for setPluginSpecificMessage is open for suggestions. We could make it a bit more verbose like this: https://github.com/intel/llvm/blob/484cf252246a958b089a8e94e35b14bd791a213c/sycl/plugins/cuda/pi_cuda.cpp#L181
But there is a 256 char limit to be aware of.

kbenzie · 2023-05-25T13:57:19Z

Since I've not managed to get round to creating a full PR yet, here it is in shortform:

-ur_result_t urGetLastResult(
+ur_result_t urPlatformGetLastError(
     hPlatform,
     const char** ppMessage,
+    const int32_t *pError
);

igchor · 2023-06-01T22:28:55Z

@pbalcer @vinser52 I think we might also need something similar for UMA (a way to return native error code). When using UMA memory provider on top of L0 we lose L0-specific error codes and L0 adapter might want to know exactly, why the allocation failed.

vinser52 · 2023-06-02T12:49:10Z

@pbalcer @vinser52 I think we might also need something similar for UMA (a way to return native error code). When using UMA memory provider on top of L0 we lose L0-specific error codes and L0 adapter might want to know exactly, why the allocation failed.

As I remember each memory provider has a get_last_error interface. Isn't it what is needed?

igchor · 2023-06-02T16:04:40Z

@pbalcer @vinser52 I think we might also need something similar for UMA (a way to return native error code). When using UMA memory provider on top of L0 we lose L0-specific error codes and L0 adapter might want to know exactly, why the allocation failed.

As I remember each memory provider has a get_last_error interface. Isn't it what is needed?

Yes, but get_last_error only returns a string describing the error and in some cases (like for L0 adapter, where we know that we are using an L0 memory provider) we might need to get the actual error status (as an int).

Remove the `urGetLastResult()` entry-point and replace it with `urPlatformGetLastError()`. This primary difference is the addition of the `pError` out parameter which returns an error code emitted from a failed driver entry-point which resulted in a Unified Runtime entry-point returning `UR_RESULT_ERROR_ADAPTER_SPECIFIC`. Fixes oneapi-src#500.

kbenzie added enhancement New feature or request needs-discussion This needs further discussion pi DPC++ PI requirement specification Changes or additions to the specification labels May 9, 2023

kbenzie changed the title ~~Improve error reporting errors up through the stack~~ Improve reporting errors up through the stack May 9, 2023

alycm assigned kbenzie May 17, 2023

kbenzie removed the needs-discussion This needs further discussion label May 17, 2023

JackAKirk mentioned this issue Jun 5, 2023

[SYCL][CUDA] Port CUDA plugin to Unified Runtime intel/llvm#9512

Merged

kbenzie mentioned this issue Jun 9, 2023

[ur] Improve reporting of adapter specific errors #589

Merged

kbenzie closed this as completed in #589 Jun 15, 2023

kbenzie closed this as completed in d05083c Jun 15, 2023

JackAKirk mentioned this issue Jun 19, 2023

[Exp][usm-p2p] Initial usm-p2p UR extension. #631

Merged

kbenzie added this to the 0.7 milestone Aug 3, 2023

kbenzie mentioned this issue Jan 10, 2024

Replace INVALID_NULL_HANDLE in entry-points with handle-specific error codes #471

Closed

JackAKirk mentioned this issue Feb 5, 2024

Refactor Level Zero Adapter to use new logger #1032

Merged

JackAKirk mentioned this issue Sep 13, 2024

[Cuda] Check for exceeding grid dimension limits (work-group count) #2050

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reporting errors up through the stack #500

Improve reporting errors up through the stack #500

kbenzie commented May 9, 2023

alycm commented May 17, 2023

JackAKirk commented May 19, 2023 •

edited

Loading

kbenzie commented May 25, 2023 •

edited

Loading

igchor commented Jun 1, 2023

vinser52 commented Jun 2, 2023

igchor commented Jun 2, 2023

Improve reporting errors up through the stack #500

Improve reporting errors up through the stack #500

Comments

kbenzie commented May 9, 2023

alycm commented May 17, 2023

JackAKirk commented May 19, 2023 • edited Loading

kbenzie commented May 25, 2023 • edited Loading

igchor commented Jun 1, 2023

vinser52 commented Jun 2, 2023

igchor commented Jun 2, 2023

JackAKirk commented May 19, 2023 •

edited

Loading

kbenzie commented May 25, 2023 •

edited

Loading