Consider adding node `label`s for more diagnosable error messages for async errors. #585

philloooo · 2024-02-26T18:52:22Z

As a follow up of #572 , we propose platform specific validations should be done during the async build step.

This poses a challenge for developers: they submitted a complex graph and one step within the graph is failing a platform specific check, it's hard to trace back the specific operand in the graph the error is about.

I propose to follow WebGPU’s practice to define a MLObjectBase with a label field to let MLOperand extend from.
The usage would be like:

const builder = new MLGraphBuilder(context);

const A = builder.input('0', operandType);
const B = builder.input('1', operandType);

const C = builder.matmul(A, B); 
C.label = "step1:matmul";
const D = builder.add(A, C);
D.label = "step2:add";
// ... keep building a complex graph
...
const finalOperand = builder.add(E, F);

// Build the graph.
const graph = await builder.build({'output': finalOperand});


> Uncaught DOMException: Model graph build error: [Operand "step1:matmul"] input dimensions XXX exceed supported limit.

The MLObjectBase could also be extended by:

MLBuffer to help with debugging async buffer related errors.
MLGraph to help with debugging async errors from chained inference.

The text was updated successfully, but these errors were encountered:

zolkis · 2024-02-27T13:57:31Z

Could the build process auto-add the labels according to an auto-instrumentation algorithm to be specified?

philloooo · 2024-02-27T17:34:04Z

@zolkis that's an interesting idea, can you elaborate more with some examples?

zolkis · 2024-02-28T20:35:27Z

Well, IIUC the example above has the use case to annotate ops with labels, to help developers figure out what went wrong when exceptions are thrown.

If that is the main use case, boilerplate code is needed for manually adding a label at every single level. When the label is not specified, there is no information in the exception.

Instead of manually setting explicit labels, annotations (implicit labels) could be added automatically by the build algorithm, which knows where in the compute graph it currently is. When an exception occurs, the internal label like "step1:matmul" (standard names to be defined) could be passed.

The example is changed just in that the labels are not developer-injected, but internally generated (as will be specified by an eventual future algorithm).

const builder = new MLGraphBuilder(context);

const A = builder.input('0', operandType);
const B = builder.input('1', operandType);

const C = builder.matmul(A, B); 
const D = builder.add(A, C);
// ... keep building a complex graph
...
const finalOperand = builder.add(E, F);

// Build the graph.
const graph = await builder.build({'output': finalOperand});

> Uncaught DOMException: Model graph build error: [Operand "step1:matmul"] input dimensions XXX exceed supported limit.

IOW, the labels could be owned and attached by the implementation.

The advantage is that this covers all graphs at all level automatically, the disadvantage is the lack of developer-given labels (and no possibility to piggy-back other instrumentation). However, the whole process is under the control of the implementation, no worries about sanitizing/checking developer injected labels.

I am not even sure we must standardize the name space for such labels, as this could be owned by the implementations (when it's only meant for human eyes) -- unless programmatic handling of that information is needed.

On the other hand, if there is experience and positive developer feedback on the label feature in WebGPU (citations needed), I have no objections using it also in WebNN.

inexorabletash · 2024-02-29T01:14:19Z

annotations (implicit labels) could be added automatically by the build algorithm, which knows where in the compute graph it currently is

Thanks for explaining - I like this idea. Basically in build() it would copy the label from the MLOperand if set, if not would use an algorithm to generate a label. This gets stuffed in what I'm calling the "platform operator" (or is it the "platform operand"?), and used if build() or compute() fails.

A few notes:

Since build() is async we need to deal with labels changing during the build process. Right now MLOperand is immutable. This proposal would change that. Various options are possible, including freezing or snap-shotting, but it needs to be thought through. Also relevant if a builder can be re-used (Can an MLGraphBuilder be reused? #567) I think?
Re: "I am not even sure we must standardize the name space for such labels" - someone will end up parsing them, even if we say "please don't!", so we probably will have to specify them, but we can look for precedent

zolkis · 2024-03-07T15:57:03Z

Basically in build() it would copy the label from the MLOperand if set, if not would use an algorithm to generate a label.

I support this, looks like the best way to go.

philloooo · 2024-03-07T17:00:36Z

thanks @zolkis ! auto-instrumentation would be indeed less work for developers. The thing I was concerned is whether the system can add meaningful enough labels. The more complex a model gets to, the less useful is something like a step number(as in your example).

Imagine a transformer model, the developers would probably namespace the labels to something like: decoder.attn.0.cross_attn.conv1 to give structural context.

Allowing developers to specify labels, and fallback to system label seems more plausible.

As for WebGPU label usage, I don't have exact stats to point to, but I did consult our WebGPU team and they mentioned developers find the labels extremely useful.

zolkis · 2024-03-07T18:43:04Z

Right, developers can add labels - but when they don't, does it mean they don't want any other information, i.e. should we ditch the auto-generation? Or set an option for that?

philloooo · 2024-03-07T19:42:01Z

For auto-generated information, we have a couple options:

use them as default labels only when developers don't specify labels.
Just always add these information to the error message, after the user specified labels. And we don't need to specify them in the spec, it's implementation detail. The spec only specifies that if developers provide labels, they will be included in the error message.

option 2 will look like:

[Operand "user_specified_label"] op:mul step:2 - input dimensions XXX exceed supported limit.

It seems that if the user agent can add useful annotation to the error message they can just always add them. So I'd prefer option 2?

fdwr · 2024-03-18T22:36:11Z

Related, we're currently trying to diagnose some Yolo-V9 slice issues, and the lack of diagnostic info is impeding investigation ("Failed to execute 'slice' on 'MLGraphBuilder': For dimension (0): the starting index to slice must be less than input size (2)." - cool, but which slice node?...)

inexorabletash · 2024-03-18T23:31:10Z

I may have said this in a WG telecon, but this feels like something where a prototype implementation would help inform the spec. So if any of the Chromium contributors who are feeling the pain want to hack something in, don't wait on spec discussions and it doesn't need to be perfect! Let's iterate and learn.

lisa0314 · 2024-04-26T02:34:38Z

This CL 5492314: WebNN: inital implementation of Add label for mloperand | https://chromium-review.googlesource.com/c/chromium/src/+/5492314 attempts to add label for MLOperand to report more detailed
error message during the async build. As POC, I only add the label for slice operator.

const builder = new MLGraphBuilder(context);

const A = builder.input('0', operandType);
const B = builder.input('1', operandType);

const C = builder.matmul(A, B); 
C.label = "step1:matmul";
const D = builder.add(A, C);
D.label = "step2:add";
// ... keep building a complex graph
...
const finalOperand = builder.add(E, F);

// Build the graph.
const graph = await builder.build({'output': finalOperand});

One question about the IDL definition. If a sequence of operands were returned when invoking builder.split(), should we set labels for all operands?

lisa0314 · 2024-04-26T02:39:13Z

Another proposal: the label also could be added into MLOperator.

const builder = new MLGraphBuilder(context);

const A = builder.input('0', operandType);
const B = builder.input('1', operandType);

const C = builder.matmul(A, B, "matmul"); 
const D = builder.add(A, C, "add");
// ... keep building a complex graph
...
const finalOperand = builder.add(E, F, "add2");

// Build the graph.
const graph = await builder.build({'output': finalOperand});

Any thoughts?

mingmingtasd · 2024-04-26T03:08:21Z

Another proposal: the label also could be added into MLOperator.

const builder = new MLGraphBuilder(context);

const A = builder.input('0', operandType);
const B = builder.input('1', operandType);

const C = builder.matmul(A, B, "matmul"); 
const D = builder.add(A, C, "add");
// ... keep building a complex graph
...
const finalOperand = builder.add(E, F, "add2");

// Build the graph.
const graph = await builder.build({'output': finalOperand});

Any thoughts?

Agree! From our experience of debugging the graph translated from frameworks like onnxruntime, the operator/node name is very useful to match the operator in the .onnx model with the implemented operator in backends of WebNN.
Especially when there are too many operators in some large models, it's difficult to distinguish which operator we are supporting on WebNN since these operators can have same type/attributes. But the operator's name is unique, if we can pass the name to WebNN backend, our debugging process will be easier, and we can also report more detailed error messages to web user/developer, for example: Failed to create the matmul operator: /layers.0/self_attn/q_proj/MatMul, then you can open the .onnx model in some visualization tool like https://netron.app/ to search the operator by the name.

mingmingtasd · 2024-04-26T03:14:36Z

We can have both of them: operand name and operator name.
Another proposal: for the operand label, we can extend it like

const C = builder.matmul(A, B); 
C.label = {name: "C", fromOperator: "step1:matmul"}

philloooo · 2024-05-02T17:49:27Z

The downside with adding to the MLOperator is that it doesn't exist as a concept in the spec, so we will need to add the param to each of the builder method. It also makes the param list longer.

Alternative A - add to `options` dict

Another alternative is to add it to the options dict, so now all the builder methods take a options dict, existing option dict like MLClampOptions will inherit from the base options dict that has a label field.

Alternative B - extend MLOperand's label field

Extending the operand label field seems a bit non-intuitive:

For most cases, we just want the label to take a string. So then do we support both string and object type for label?
If we have C, D = builder.method(A, B); and C, D set different fromOperator, do we let the last one win?

Alternative C - keep with current proposal

We can also keep with current proposal, treat the multiple outputs case as an edge case and handle it less elegantly.
For example for lstm, we can iterate through the output operands, and use the first output operand's label string that's not empty. It's probably still sufficiently informational for debugging.

zolkis · 2024-05-03T14:53:55Z

Considering Joshua's comment, alternative A seems to fit the best (to handle MLOperand's immutability vs labels mutability). The parameter / options-dict can be optional.

We also need an algorithm for generating good enough default labels (impl. specific, but as per the comment above, we should rather standardize the namespace before devs start to parse them and come up with various private namespaces).
If that is good enough, we'd probably won't need manual labels very often.

But if labels are used frequently, then from a developer coding perspective, I'd prefer the solution with setting the labels on separate lines, like in alternative B, as it allows separating code instrumentation from business logic. If we could find a mean to correctly do this, I'd go with that.

lisa0314 · 2024-05-11T02:54:34Z

This 5528797: WebNN: initial implementation of adding name for MLOperator | https://chromium-review.googlesource.com/c/chromium/src/+/5528797 attempts to add label for MLOperator to report more detailed
error message during the async build. As POC, I only add the label for resample2d operator.

const builder = new MLGraphBuilder(context);

const inputShape = [1, 1, 2, 4];
const input = builder.input(
      `input`,
      {dataType: 'float32', dimensions: inputShape});
const options = {scales: [2.0, 2.0], label: "resample2d"};
 const resample2d =
      builder.resample2d(input, options);

// Build the graph.
const graph = await builder.build({'output': resample2d});

fdwr · 2024-06-15T00:55:06Z

add label for MLOperand ... adding name for MLOperator

If we could only pick one to have labels (nodes or edges), I also prefer node names (https://chromium-review.googlesource.com/c/chromium/src/+/5528797) over edge names (https://chromium-review.googlesource.com/c/chromium/src/+/5492314).

However, I noticed some models have no node names, only edge names. So having the true edge names makes looking for a match in the original graph as easy as Ctrl+F in tools like Netron:

Mingming commented in the CR that we could generate edge names from the node names, but if we think that generating edge names is useful, then being able to pass the actual edge names is even more useful. Though, if we think reporting the node name and WebNN parameter name suffices (e.g. “conv2d” operator and its “filter” parameter), then we need neither explicit edge labels nor implicitly generated edge labels.

fdwr · 2024-06-15T01:01:35Z

As POC, I only add the label for resample2d operator.

🤔 @lisa0314 It could be helpful in the POC to include at least one more operator that doesn't already have an options dictionary, like add or cast, to visualize how that works. Would they accept an additional parameter that defaulted to the MLLabelOptions in your updated IDL?

partial interface MLGraphBuilder {
  MLOperand add(MLOperand a, MLOperand b, optional MLLabelOptions options = {});
  ...
};

lisa0314 · 2024-06-17T02:16:39Z

It could be helpful in the POC to include at least one more operator that doesn't already have an options dictionary, like add or cast, to visualize how that works. Would they accept an additional parameter that defaulted to the MLLabelOptions in your updated IDL?

@fdwr Good point! I will add one more operator which doesn't have any options in the POC CL. Thanks!

philloooo · 2024-06-18T20:51:24Z

@fdwr thanks for thinking through how it works with onnx.
It seems that for webnn error messages , just the operator label is suffice.
Generating edge labels as suggested by Mingming seems more for for attaching to dml entities that can be used for chromium dev debugging, is that correct?

fdwr · 2024-06-18T21:15:15Z

Generating edge labels as suggested by Mingming seems more for for attaching to dml entities that can be used for chromium dev debugging, is that correct?

@philloooo: It might help some, but then DML also supports names on nodes anyway:

struct DML_OPERATOR_GRAPH_NODE_DESC 
{ 
    IDMLOperator* Operator; 
    _Field_z_ _Maybenull_ const char* Name;       <<<<<<<<
};

thanks for thinking through how it works with onnx.

So I'm seeing these missing node names really with older ONNX models, whereas all of the more recent conversions/exports I've looked through have node names, and although they are technically still optional, in practice, they appear to be present.

It seems that for webnn error messages , just the operator label is suffice.

Given the above, I'm content with node labels only (and if we found some more value to edge labels in the future, it would be a simple non-breaking addition).

mingmingtasd · 2024-06-19T07:34:31Z

Given the above, I'm content with node labels only (and if we found some more value to edge labels in the future, it would be a simple non-breaking addition).

Agree! Thanks for the discussion! 👍

This CL attempts to add label to MLOperator to report more detailed error message for prelu and resample2d. And other operators will be supported in a following separated CL. The related spec issue is under discussion- webmachinelearning/webnn#585. Bug: 1273291 Change-Id: I4880e48eaa6c203bf5428b0672c73ca2beb8c76c Cq-Include-Trybots: luci.chromium.try:win11-blink-rel,mac14.arm64-blink-rel,mac14-blink-rel, linux-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5528797 Reviewed-by: Phillis Tang <phillis@chromium.org> Commit-Queue: Phillis Tang <phillis@chromium.org> Reviewed-by: ningxin hu <ningxin.hu@intel.com> Cr-Commit-Position: refs/heads/main@{#1322314}

This adds an internal 'label' property to the operators that are created as a graph is constructed, which MAY (in the RFC 2119 sense) be used by implementations in async error messages. Developers populate this via a 'label' member in the options dictionary for MLGraphBuilder methods. A new MLOperatorOptions dictionary is defined, and all existing options dictionaries now inherit from this, and all relevant methods now take an options dictionary. Fixes webmachinelearning#585

* Add optional operator labels for more diagnosable error messages This adds an internal 'label' property to the operators that are created as a graph is constructed, which MAY (in the RFC 2119 sense) be used by implementations in async error messages. Developers populate this via a 'label' member in the options dictionary for MLGraphBuilder methods. A new MLOperatorOptions dictionary is defined, and all existing options dictionaries now inherit from this, and all relevant methods now take an options dictionary. Fixes #585 * Add note encouraging implementations * Revise note to mention sync errors * Update index.bs Co-authored-by: Dwayne Robinson <dwayner@microsoft.com> * don't pass options twice --------- Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>

anssiko added the feature request label Feb 27, 2024

fdwr changed the title ~~Consider using label to allow better error handling for async errors.~~ Consider adding node labels for more diagnosable error messages for async errors. Mar 7, 2024

inexorabletash mentioned this issue Jul 25, 2024

Add optional operator labels for more diagnosable error messages #742

Merged

fdwr closed this as completed in #742 Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding node `label`s for more diagnosable error messages for async errors. #585

Consider adding node `label`s for more diagnosable error messages for async errors. #585

philloooo commented Feb 26, 2024

zolkis commented Feb 27, 2024

philloooo commented Feb 27, 2024

zolkis commented Feb 28, 2024

inexorabletash commented Feb 29, 2024

zolkis commented Mar 7, 2024

philloooo commented Mar 7, 2024

zolkis commented Mar 7, 2024

philloooo commented Mar 7, 2024 •

edited

Loading

fdwr commented Mar 18, 2024 •

edited

Loading

inexorabletash commented Mar 18, 2024

lisa0314 commented Apr 26, 2024 •

edited

Loading

lisa0314 commented Apr 26, 2024 •

edited

Loading

mingmingtasd commented Apr 26, 2024

mingmingtasd commented Apr 26, 2024

philloooo commented May 2, 2024 •

edited

Loading

zolkis commented May 3, 2024 •

edited

Loading

lisa0314 commented May 11, 2024 •

edited

Loading

fdwr commented Jun 15, 2024 •

edited

Loading

fdwr commented Jun 15, 2024 •

edited

Loading

lisa0314 commented Jun 17, 2024

philloooo commented Jun 18, 2024

fdwr commented Jun 18, 2024 •

edited

Loading

mingmingtasd commented Jun 19, 2024

Consider adding node labels for more diagnosable error messages for async errors. #585

Consider adding node labels for more diagnosable error messages for async errors. #585

Comments

philloooo commented Feb 26, 2024

zolkis commented Feb 27, 2024

philloooo commented Feb 27, 2024

zolkis commented Feb 28, 2024

inexorabletash commented Feb 29, 2024

zolkis commented Mar 7, 2024

philloooo commented Mar 7, 2024

zolkis commented Mar 7, 2024

philloooo commented Mar 7, 2024 • edited Loading

fdwr commented Mar 18, 2024 • edited Loading

inexorabletash commented Mar 18, 2024

lisa0314 commented Apr 26, 2024 • edited Loading

lisa0314 commented Apr 26, 2024 • edited Loading

mingmingtasd commented Apr 26, 2024

mingmingtasd commented Apr 26, 2024

philloooo commented May 2, 2024 • edited Loading

Alternative A - add to options dict

Alternative B - extend MLOperand's label field

Alternative C - keep with current proposal

zolkis commented May 3, 2024 • edited Loading

lisa0314 commented May 11, 2024 • edited Loading

fdwr commented Jun 15, 2024 • edited Loading

fdwr commented Jun 15, 2024 • edited Loading

lisa0314 commented Jun 17, 2024

philloooo commented Jun 18, 2024

fdwr commented Jun 18, 2024 • edited Loading

mingmingtasd commented Jun 19, 2024

Consider adding node `label`s for more diagnosable error messages for async errors. #585

Consider adding node `label`s for more diagnosable error messages for async errors. #585

philloooo commented Mar 7, 2024 •

edited

Loading

fdwr commented Mar 18, 2024 •

edited

Loading

lisa0314 commented Apr 26, 2024 •

edited

Loading

lisa0314 commented Apr 26, 2024 •

edited

Loading

philloooo commented May 2, 2024 •

edited

Loading

Alternative A - add to `options` dict

zolkis commented May 3, 2024 •

edited

Loading

lisa0314 commented May 11, 2024 •

edited

Loading

fdwr commented Jun 15, 2024 •

edited

Loading

fdwr commented Jun 15, 2024 •

edited

Loading

fdwr commented Jun 18, 2024 •

edited

Loading