-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
fix(worker): Don't fail Worker on Activity Task decode failures #1473
fix(worker): Don't fail Worker on Activity Task decode failures #1473
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level comments:
-Is this better for a "DEBUG" mode, or we want workers never to fail in production? In general, you want to fail fast when an unrecoverable error happens in prod, and not continue unless you know for sure the state has not been messed up... We are catching all exceptions, can we be sure of that?
-Are activities going to keep retrying, or just fail the workflow straight away (better the later).
- Can other things apart from Date be unexpectedly null?
nonRetryable: false, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to retry? Parsing will fail again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Payload conversion error could be due to a transient issue (e.g. an error accessing KMS, or old worker getting a task that would work in newer worker).
I agree that retrying might sometime be problematic, but it was already that way previously, and redefining that behavior is outside the scope of this PR.
break; | ||
} | ||
let args: unknown[]; | ||
try { | ||
args = await decodeArrayFromPayloads(this.options.loadedDataConverter, task.start?.input); | ||
} catch (err) { | ||
} catch (e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just move the break after the catch, and just have one break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's arguable, but intentional from my part. I personally feel that duplicating the break statement here makes it more obvious that we reached the conclusion of these logic branches. The output = { ... }
statements don't stand out enough by themselves.
packages/worker/src/worker.ts
Outdated
try { | ||
args = await decodeArrayFromPayloads(this.options.loadedDataConverter, task.start?.input); | ||
} catch (err) { | ||
} catch (e) { | ||
const error = ensureApplicationFailure(e); | ||
output = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if it is not an application failure? Can you add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensureApplicationFailure
is an already existing internal helper function.
/**
* If `error` is already an `ApplicationFailure`, returns `error`.
*
* Otherwise, converts `error` into an `ApplicationFailure` with:
*
* - `message`: `error.message` or `String(error)`
* - `type`: `error.constructor.name` or `error.name`
* - `stack`: `error.stack` or `''`
*/
export function ensureApplicationFailure(error: unknown): ApplicationFailure {
I don't think there's more to say than this. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that was an assertion, never mind...
throw ApplicationFailure.fromError(e, { | ||
message: `Failed to parse heartbeat details for activity ${activityId}: ${errorMessage(e)}`, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this non-retriable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default is retryable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant whether we should make it non-retryable and fail the workflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Payload conversion error could be due to transient issues.
throw ApplicationFailure.fromError(e, { | ||
message: `Failed to parse heartbeat details for activity ${activityId}: ${errorMessage(e)}`, | ||
}); | ||
} | ||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure that if this code throws we log and error and dump the task we got so we can inspect this better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added log, including proto.toJSON()
and the unaltered protbuf-encoded buffer (in base64 notation).
2024-07-24T04:45:57.463Z [ERROR] Error while processing ActivityTask.start: Expected scheduledTime to be a timestamp, got null {
sdkComponent: 'worker',
taskQueue: 'test',
error: TypeError: Expected scheduledTime to be a timestamp, got null
at requiredTsToMs (/Users/jwatkins/Development/Temporal/TypeScriptSDK/sdk-typescript-2/packages/common/src/time.ts:37:11)
at Worker.extractActivityInfo (/Users/jwatkins/Development/Temporal/TypeScriptSDK/sdk-typescript-2/packages/worker/src/worker.ts:1064:43)
at runNextTicks (node:internal/process/task_queues:60:5)
at listOnTimeout (node:internal/timers:540:9)
at process.processTimers (node:internal/timers:514:7)
at <anonymous> (/Users/jwatkins/Development/Temporal/TypeScriptSDK/sdk-typescript-2/packages/worker/src/worker.ts:870:28),
task: '{"taskToken":"ODQxYjEyM2EtYjc5MS00ZmE3LThmODctMTI3YjgxOWZlZWUw","start":{"workflowExecution":{"workflowId":"wfid","runId":"runId"},"activityType":"dummy","currentAttemptScheduledTime":{"seconds":"0","nanos":0},"startedTime":{"seconds":"0","nanos":0},"attempt":1,"scheduleToCloseTimeout":{"seconds":"60","nanos":0},"startToCloseTimeout":{"seconds":"60","nanos":0},"heartbeatTimeout":{"seconds":"60","nanos":0}}}',
taskEncoded: 'CiQ4NDFiMTIzYS1iNzkxLTRmYTctOGY4Ny0xMjdiODE5ZmVlZTAaNhoNCgR3ZmlkEgVydW5JZCoFZHVtbXlSBAgAEABaBAgAEABgAWoECDwQAHIECDwQAHoECDwQAA=='
}
args = await decodeArrayFromPayloads(this.options.loadedDataConverter, task.start?.input); | ||
} catch (err) { | ||
} catch (e) { | ||
const error = ensureApplicationFailure(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd want to ensure we have error logs in addition to failing the tasks. Didn't verify that this exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment.
Of course. But for various reasons, timestamps appear more likely to pose issues than other data types in our specific use case. |
…oralio#1473) (cherry picked from commit 4dd83c1)
What was changed and why?
Catch all unexpected exceptions while decoding Activity Tasks in TS Worker. That notably includes things such as: protobuf fields that are unexpectedly null, errors while decoding heartbeat details, errors while instantiating activity interceptors, etc.
Also, include field names in error messages reported by
tsToMs()
andtsToDate()
when thets
value is null or undefined.Fix #1404.