Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix(worker): Don't fail Worker on Activity Task decode failures #1473

Merged
merged 6 commits into from
Jul 24, 2024

Conversation

mjameswh
Copy link
Contributor

@mjameswh mjameswh commented Jul 23, 2024

What was changed and why?

  • Catch all unexpected exceptions while decoding Activity Tasks in TS Worker. That notably includes things such as: protobuf fields that are unexpectedly null, errors while decoding heartbeat details, errors while instantiating activity interceptors, etc.

    • These exceptions would previously result in failure of the Worker.
  • Also, include field names in error messages reported by tsToMs() and tsToDate() when the ts value is null or undefined.

    • This should make it easier to triage and diagnose a specific category of issues we have been observing recently due to a change in the server-side protobuf library.

Fix #1404.

@mjameswh mjameswh requested a review from a team as a code owner July 23, 2024 20:26
Copy link
Contributor

@antlai-temporal antlai-temporal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level comments:
-Is this better for a "DEBUG" mode, or we want workers never to fail in production? In general, you want to fail fast when an unrecoverable error happens in prod, and not continue unless you know for sure the state has not been messed up... We are catching all exceptions, can we be sure of that?
-Are activities going to keep retrying, or just fail the workflow straight away (better the later).

  • Can other things apart from Date be unexpectedly null?

Comment on lines +889 to +890
nonRetryable: false,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to retry? Parsing will fail again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Payload conversion error could be due to a transient issue (e.g. an error accessing KMS, or old worker getting a task that would work in newer worker).

I agree that retrying might sometime be problematic, but it was already that way previously, and redefining that behavior is outside the scope of this PR.

Comment on lines 919 to +920
break;
}
let args: unknown[];
try {
args = await decodeArrayFromPayloads(this.options.loadedDataConverter, task.start?.input);
} catch (err) {
} catch (e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just move the break after the catch, and just have one break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's arguable, but intentional from my part. I personally feel that duplicating the break statement here makes it more obvious that we reached the conclusion of these logic branches. The output = { ... } statements don't stand out enough by themselves.

Comment on lines 889 to 922
try {
args = await decodeArrayFromPayloads(this.options.loadedDataConverter, task.start?.input);
} catch (err) {
} catch (e) {
const error = ensureApplicationFailure(e);
output = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if it is not an application failure? Can you add a comment?

Copy link
Contributor Author

@mjameswh mjameswh Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureApplicationFailure is an already existing internal helper function.

/**
 * If `error` is already an `ApplicationFailure`, returns `error`.
 *
 * Otherwise, converts `error` into an `ApplicationFailure` with:
 *
 * - `message`: `error.message` or `String(error)`
 * - `type`: `error.constructor.name` or `error.name`
 * - `stack`: `error.stack` or `''`
 */
export function ensureApplicationFailure(error: unknown): ApplicationFailure {

I don't think there's more to say than this. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that was an assertion, never mind...

Comment on lines +1749 to +1751
throw ApplicationFailure.fromError(e, {
message: `Failed to parse heartbeat details for activity ${activityId}: ${errorMessage(e)}`,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this non-retriable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default is retryable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant whether we should make it non-retryable and fail the workflow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Payload conversion error could be due to transient issues.

throw ApplicationFailure.fromError(e, {
message: `Failed to parse heartbeat details for activity ${activityId}: ${errorMessage(e)}`,
});
}
return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure that if this code throws we log and error and dump the task we got so we can inspect this better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added log, including proto.toJSON() and the unaltered protbuf-encoded buffer (in base64 notation).

2024-07-24T04:45:57.463Z [ERROR] Error while processing ActivityTask.start: Expected scheduledTime to be a timestamp, got null {
  sdkComponent: 'worker',
  taskQueue: 'test',
  error: TypeError: Expected scheduledTime to be a timestamp, got null
      at requiredTsToMs (/Users/jwatkins/Development/Temporal/TypeScriptSDK/sdk-typescript-2/packages/common/src/time.ts:37:11)
      at Worker.extractActivityInfo (/Users/jwatkins/Development/Temporal/TypeScriptSDK/sdk-typescript-2/packages/worker/src/worker.ts:1064:43)
      at runNextTicks (node:internal/process/task_queues:60:5)
      at listOnTimeout (node:internal/timers:540:9)
      at process.processTimers (node:internal/timers:514:7)
      at <anonymous> (/Users/jwatkins/Development/Temporal/TypeScriptSDK/sdk-typescript-2/packages/worker/src/worker.ts:870:28),
  task: '{"taskToken":"ODQxYjEyM2EtYjc5MS00ZmE3LThmODctMTI3YjgxOWZlZWUw","start":{"workflowExecution":{"workflowId":"wfid","runId":"runId"},"activityType":"dummy","currentAttemptScheduledTime":{"seconds":"0","nanos":0},"startedTime":{"seconds":"0","nanos":0},"attempt":1,"scheduleToCloseTimeout":{"seconds":"60","nanos":0},"startToCloseTimeout":{"seconds":"60","nanos":0},"heartbeatTimeout":{"seconds":"60","nanos":0}}}',
  taskEncoded: 'CiQ4NDFiMTIzYS1iNzkxLTRmYTctOGY4Ny0xMjdiODE5ZmVlZTAaNhoNCgR3ZmlkEgVydW5JZCoFZHVtbXlSBAgAEABaBAgAEABgAWoECDwQAHIECDwQAHoECDwQAA=='
}

args = await decodeArrayFromPayloads(this.options.loadedDataConverter, task.start?.input);
} catch (err) {
} catch (e) {
const error = ensureApplicationFailure(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd want to ensure we have error logs in addition to failing the tasks. Didn't verify that this exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment.

@mjameswh
Copy link
Contributor Author

Can other things apart from Date be unexpectedly null?

Of course. But for various reasons, timestamps appear more likely to pose issues than other data types in our specific use case.

@mjameswh mjameswh merged commit 4dd83c1 into temporalio:releases/1.10.x Jul 24, 2024
69 checks passed
@mjameswh mjameswh deleted the safe-activity-worker branch July 24, 2024 17:03
mjameswh added a commit to mjameswh/sdk-typescript that referenced this pull request Jul 24, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants