Feature: Returning informative exit codes #1409

aaronsteers · 2023-02-10T01:19:43Z

As a way of communicating back to the orchestrator, it would be helpful to have 10-15 predefined exit codes for common failure scenarios.

Guidelines and best practices for exit codes

From https://tldp.org/LDP/abs/html/exitcodes.html:

exit codes 1 - 2, 126 - 165, and 255 [1] have special meanings, and should therefore be avoided for user-specified exit parameters.

From https://unix.stackexchange.com/a/604262/487180:

If you are making something that could be turned into a service, it's good to avoid conflicts with (or reuse meaning from) systemd's exit codes which defines code 2-7,200-242. This link also references BSD codes 64-78.

Therefor, a foundation strategy could be:

Use existing codes when there is an exact match with existing convention. (E.g. Keyboard interrupt)
Use custom Exit Codes between 3 and 125.
For each category, if we think the detailed codes are not fully inclusive, then reserve an "other/general" integer for that category.

Grouping of suggested return types, by category and remediation path

Each of these categories and sub-items could have a distinct return code so that the caller can understand what happened during the requested operation:

Success. (0)
- Return Code: 0
- Orchestrator action: Nothing to do, job was successful.
No-op Warning. (3)
- Proposed Exit Code: 3
- Orchestrator action: Tell the user that zero records were synced. Orchestrator can opt to turn this on without user intervention, treating the return code as a non-fatal warning, optionally record a warning in logs to say that zero records were synced.
- Configuration option:
  - enable_noop_exit_code: True to return non-0 exit code for no-op sync operations. False to return 0. Default is False (no-op sync tracked as success.)
Aborted with Partial Success. (Future) (4-9)
- Description: Signifies that incremental progress was made, despite the connector receiving an abort request. The orchestrator should expect that continually retrying will eventually result in a full sync operation. The connector should not return partial success if retrying will result in an infinite loop. Instead, a "Process Abort" message (defined in the section below) must be sent. Here is how to determine if successive retries will result in an infinite loop:
  1. The sync operation must not be a no-op. Meaning, either one or more FULL_TABLE sync's were fully completed, or one or more INCREMENTAL state messages were successfully delivered as resumable bookmarks.
  2. If repeated, the process must be designed to eventually catch up or report a proper failure message. Meaning one or both of these is true:
    - The tap successfully completed all FULL_TABLE streams and reached at least one resumable bookmark for an INCREMENTAL stream. (Full table syncs may need be ordered first by the tap to prioritize a partial success status. Not yet implemented in the SDK.)
    - OR: The tap uses STATE to resume sync on the same stream where the previous sync left off. (Not yet implemented in the SDK.)
- Orchestrator action: Nothing to do, job was partially successful. Report that at least some progress was made, and user should run again to get more records.
- Proposed Exit Codes:
  - 4 SIGTERM / KeyboardInterrupt received and sync operation was wrapped up successfully; sync is resumable.
  - 5 Max record volume limit reached; sync is resumable. (Additional records available on source.)
  - 6 Max elapsed time limit reached; sync is resumable. (Additional records available on source.)
  - 7-8 Reserved for future use.
  - 9 General/Other.
    (Additional records available on source.)
Process Abort. (10-19, 130, 137)
- Orchestrator action: Nothing to do, process was aborted by user or by user's config parameters.
- Proposed Return Codes: 10-19, 130, 137
  - 10 Operation aborted due to elapsed time restriction.
  - 11 Operation aborted due to record count restriction.
  - 130 Operation aborted by SIGINT or KeyboardInterrupt (Control+C).
  - 137 Operation aborted by SIGKILL.
  - 12-18 Reserved for future use.
  - 19 General/Other.
Configuration Error. (20-29)
- Orchestrator action: Inform the user to double-check their config. Provide documentation links to the end user to help them resolve.
- Remediation: This category generally requires user action. This does not necessarily imply there's a problem in the tap or target. More likely, the user just needs to take another pass at reviewing the config documentation, and/or double-check their credentials. Worst case scenario, this could indicate stale or incomplete documentation.
- Proposed Return Codes: 20-29
  - 20 Config validation error: missing required value.
  - 21 Config validation error: data type mismatch.
  - 22 Config validation error: validation failed (other).
  - 23 Authentication or authorization error. (Permission denied, password incorrect, etc.)
  - 24 Invalid input file paths. (For instance, the config.json or catalog.json do not exist or cannot be reached.)
  - 25-28 Reserved for future use.
  - 29 General/Other.
Environment Error. (Network, Files, or other Resources) (30-39)
- Orchestrator action: Nothing to do, tell the user what happened so they can take action re: RAM, storage, or networking.
- Remediation: Tell the user the issue: out of memory, out of storage space, or unreachable server.
- Proposed Return Codes: 30-39
  - 30 Out of memory.
  - 31 Out of disk space.
  - 32 Network issue or host-not-found.
  - 33 File not found.
  - 34 File not writeable.
  - 35-38 Reserved for future use.
  - 39 General/Other.
Connector Failure. (1, 40-59, 141)
- Orchestrator action: Tell the user there appears to be a bug in the tap or target.
- Remediation: This class of issues indicates a bug in the connector or in the backend API.
- Proposed Return Codes: 40-69, 141
  - Shared failure codes (taps, targets, and mappers):
    - 40 Singer Spec error in STDIN stream or input files.
    - 41-48 Reserved for future use.
    - 49 or 1 General/Other. Connector experienced unhandled exception.
  - Tap-specific failures:
    - 50 Misshapen data from source system or source data failed validation.
    - 51 Source data processing error.
    - 141 Target stopped listening ("Broken pipe")
      - Orchestrator action: Tell the user that the target appears to have failed. (Check target's exit code for more info.)
    - 52-54 Reserved for future use.
  - Target-specific errors:
    - 55 Data validation error in input stream.
      - Orchestrator action: Tell the user (of the target) that there appears to be a bug in the upstream tap.
      - Remediation: This class of error, raised only by targets, indicates a failure that actually occurred upstream in the tap.
    - 56 Data processing error.
    - 57-59 Reserved for future use.
Application or API Failure (Custom). (60-79)
- These 20 codes can be used for any custom exit codes that connector would like to report. Each connector can emit codes that are specific to their use case, and these do not have to be aligned across applications.
- Proposed to define 2 groups:
  - 60-69 Application Failures (Retriable) - These are likely to succeed if retried later.
  - 70-79 Application Failures (Non-Retriable) - These are unlikely to succeed unless action is taken by the user.
- E.g. Redshift could report "S3 bucket in wrong region" (non-retriable) and "cannot query table due to vacuum operation already in progress" (retriable), while SQLite could report "cannot obtain write access lock (WAL) on filesystem" (retriable).
- Orchestrator action: Treat this as a handled exception from the developer: inform the user what the error code is and ask them to check the logs for more information. Optionally print the tail of the log because this is likely to contain the specific error description as authored by the developer.

For purposes of monitoring and reporting the quality and stability of taps and targets, really only "Connector Failure" codes relevant here. The "Configuration Errors" category might also be a sign of poor docs or outdated docs. Assuming the other errors are correctly raised, all other issue groups are: user errors, OS/container issues, or networking issues.

Why do we need this?

Today orchestrators like Meltano have no way to distinguish what actually happened if a subprocess fails - except for a human to manually read over the detailed log files. By adding this into the SDK, the return code of the subprocess would immediately tell Meltano how to advise the user on next steps. Other orchestrators like Airflow could also incorporate these return codes when deciding whether to attempt a retry, and how to message back to users on next steps.

Regarding "partial success" codes

Details

All of the partial-success codes discussed here, should probably have some config option to let them return 0 status if the caller doesn't care about one or all of the detailed status codes.

There are use cases where we want to open up the idea of "partial" success - but importantly to tell the caller of the process what actually happened that made the sync not a "full" success.

For instance, if running in lambda , we will need an execution time limit. At the end of that time limit (provided in config.json, most likely), we'll expect the tap to try to wrap things up and close out its processes. Its return value in these cases should indicate 0 if all upstream records were successfully received within the window or something non-zero if more records were available which were not synced.

An orchestrator like Meltano will also want to know the difference between "Sync complete" and "Sync complete (no data found)". By providing a non-zero return code for the "no data found" case, we let Meltano message this properly to the user - rather than only being able to provide a simple "sync completed" message.

Precedent and existing return code conventions

Details

Below is a subset of linux return codes found with some googling. We don't need to use these integer codes, and we don't need to keep these categories. Listed here for discussion/inspiration.

Exit Code	Description	Notes	Retry/Recovery Logic
0	Success	As today: use for clean success	N/A
1	Error	Generic failure. Alternatively, consider specific code like 255 for "other/unknown failure"	Undefined
2	No such file or directory	If a config file or catalog file does not exist.	Don't retry
6	No such device or address
7	Argument list too long
8	Exec format error
12	Cannot allocate memory
13	Permission denied
28	No space left on device
30	Read-only file system
32	Broken pipe
54	Exchange full
61	No data available	(Opt-in behavior.) Raise if tap runs to completion but no records are returned.	No retry
62	Timer expired	(Opt-in behavior.) Raise if a timer or record limit is reached but more records remain unparsed.	Retry decision up to orchestrator.
71	Protocol error	Something doesn't match the Singer Spec	No retry
75	Value too large for defined data type	Maybe combine this with generic "data validation failure" category.	No auto-retry.
84	Invalid or incomplete multibyte or wide character	Maybe combine this with generic "data validation failure" category.	No auto-retry.
90	Message too long
104	Connection reset by peer	Server may be busy or experiencing an issue.	Orchestrator can retry with backoff.
110	Connection timed out	Server may be busy or experiencing an issue.	Orchestrator can retry with backoff.
111	Connection refused	Server may be busy or network issues exist.	Orchestrator can retry with backoff.
112	Host is down	Server may be busy or network issues exist.	Orchestrator can retry with backoff.
113	No route to host	Likely network issue or configuration issue.	Orchestrator can retry with backoff. May require operator to research.
114	Operation already in progress	If the tap or target can detect this, throw an error.
125	Operation canceled	User-requested abort, e.g. Ctrl+C	Orchestrator should ask the user what to do next.

The text was updated successfully, but these errors were encountered:

aaronsteers · 2023-03-03T04:03:35Z

I've updated the above so that each error type is grouped together with similar errors. And I have added specific notes about the actions Meltano (or another orchestrator) might take on seeing a given code.

In regards to monitoring/reporting connector quality, "Group F" and "Group G" are the ones we'd watch for.

We'd generally also want to watch for Group D (configuration errors) as a sign of poor or outdated documentation.

cc @tayloramurphy, @DouweM, @pnadolny13

tayloramurphy · 2023-03-03T15:45:16Z

@aaronsteers I made https://github.com/meltano/internal-product/issues/187 to track the meta requirements

aaronsteers · 2023-03-03T20:33:42Z

@tayloramurphy - sounds good! I've added to the office hours board to collect ideas.

Spec-wise, and in terms of defining the path forward from here:

The last piece I'm not sure of would be which specific codes to use. We could try to find precedent of integer codes used already in prior art, or we could just start fresh and declare a new domain of custom return code integer values.

aaronsteers · 2023-03-25T05:03:39Z

I've updated the issue description to include a set of proposed (and tentative) exit code integers and reserved ranges.

Feedback and counter-suggestions much appreciated.

laurentS · 2023-06-23T10:16:22Z

Short comment to say I like this a lot! One situation we would like to report back is when the tap runs out of quota. I'm not sure which code above I'd use for this specific situation (maybe a custom one for the specific tap, though I feel like this use-case might be general enough to consider adding it to the sdk itself?).

aaronsteers changed the title ~~Feature: Option to return informative exit codes~~ Feature: Returning informative exit codes Feb 10, 2023

tayloramurphy added kind/Feature New feature or request valuestream/SDK labels Feb 10, 2023

This was referenced Feb 21, 2023

feat: Set Stream._MAX_RECORDS_LIMIT during tap testing #1399

Closed

fix: handle sync abort, reduce duplicate STATE messages, rename _MAX_RECORD_LIMIT as ABORT_AT_RECORD_COUNT #1436

Merged

aaronsteers added this to Office Hours Mar 3, 2023

github-project-automation bot moved this to To Discuss in Office Hours Mar 3, 2023

aaronsteers added the Accepting Pull Requests label Mar 3, 2023

aaronsteers added this to the v1.0 Release milestone Mar 3, 2023

aaronsteers moved this from To Discuss to Up Next in Office Hours Mar 8, 2023

aaronsteers moved this from Up Next to Discussed in Office Hours Mar 15, 2023

tayloramurphy mentioned this issue Mar 25, 2023

Error Handling and dead letter queues for targets #133

Open

tayloramurphy mentioned this issue May 1, 2023

Monitor tap and target process memory usage and warn when limit is exceeded meltano/meltano#2476

Closed

2 tasks

tayloramurphy mentioned this issue May 18, 2023

SDK v1 Release meltano/product#1

Open

This was referenced Nov 28, 2024

Fail/warn when there are no selected streams #2779

Open

refactor: Fail early if input files to --catalog or --state do not exist #2788

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Returning informative exit codes #1409

Feature: Returning informative exit codes #1409

aaronsteers commented Feb 10, 2023 •

edited

Loading

aaronsteers commented Mar 3, 2023 •

edited

Loading

tayloramurphy commented Mar 3, 2023

aaronsteers commented Mar 3, 2023 •

edited

Loading

aaronsteers commented Mar 25, 2023

laurentS commented Jun 23, 2023

Feature: Returning informative exit codes #1409

Feature: Returning informative exit codes #1409

Comments

aaronsteers commented Feb 10, 2023 • edited Loading

Guidelines and best practices for exit codes

Grouping of suggested return types, by category and remediation path

Why do we need this?

Regarding "partial success" codes

Precedent and existing return code conventions

aaronsteers commented Mar 3, 2023 • edited Loading

tayloramurphy commented Mar 3, 2023

aaronsteers commented Mar 3, 2023 • edited Loading

aaronsteers commented Mar 25, 2023

laurentS commented Jun 23, 2023

aaronsteers commented Feb 10, 2023 •

edited

Loading

aaronsteers commented Mar 3, 2023 •

edited

Loading

aaronsteers commented Mar 3, 2023 •

edited

Loading