Skip to content

Commit

Permalink
docs: Documented examples of stream glob expressions and property ali…
Browse files Browse the repository at this point in the history
…asing (#2595)
  • Loading branch information
edgarrmondragon committed Aug 8, 2024
1 parent 75eb494 commit 2df9b04
Show file tree
Hide file tree
Showing 6 changed files with 89 additions and 37 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@ singer-sdk = { version="~=0.39.1", extras = ["testing"] }
[tool.poetry.extras]
s3 = ["fs-s3fs"]

[tool.pytest.ini_options]
addopts = '--durations=10'

[tool.mypy]
python_version = "3.12"
warn_unused_configs = true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ singer-sdk = { version="~=0.39.1", extras = ["testing"] }
[tool.poetry.extras]
s3 = ["fs-s3fs"]

[tool.pytest.ini_options]
addopts = '--durations=10'

[tool.mypy]
python_version = "3.12"
warn_unused_configs = true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,13 @@ singer-sdk = { version="~=0.39.1", extras = ["testing"] }
[tool.poetry.extras]
s3 = ["fs-s3fs"]

[tool.pytest.ini_options]
addopts = '--durations=10'

[tool.mypy]
python_version = "3.12"
warn_unused_configs = true

[tool.ruff]
src = ["{{cookiecutter.library_name}}"]
target-version = "py38"
Expand Down
90 changes: 73 additions & 17 deletions docs/stream_maps.md
Original file line number Diff line number Diff line change
Expand Up @@ -435,21 +435,7 @@ stream_maps:
```
````

#### Q: What is the difference between `primary_keys` and `key_properties`?

**A:** These two are _generally_ identical - and will only differ in cases like the above where `key_properties` is manually
overridden or nullified by the user of the tap. Developers will specify `primary_keys` for each stream in the tap,
but they do not control if the user will override `key_properties` behavior when initializing the stream. Primary keys
describe the nature of the upstream data as known by the source system. However, either through manual catalog manipulation and/or by
setting stream map transformations, the in-flight dedupe keys (`key_properties`) may be overridden or nullified by the user at any time.

Additionally, some targets do not support primary key distinctions, and there are valid use cases to intentionally unset
the `key_properties` in an extract-load pipeline. For instance, it is common to intentionally nullify key properties to trigger
"append-only" loading behavior in certain targets, as may be required for historical reporting. This does not change the
underlying nature of the `primary_key` configuration in the upstream source data, only how it will be landed or deduped
in the downstream source.

## Aliasing a stream using `__alias__`
### Aliasing a stream using `__alias__`

To alias a stream, simply add the operation `"__alias__": "new_name"` to the stream
definition. For example, to alias the `customers` stream as `customer_v2`, use the
Expand All @@ -475,7 +461,7 @@ stream_maps:
```
````

## Duplicating or splitting a stream using `__source__`
### Duplicating or splitting a stream using `__source__`

To create a new stream as a copy of the original, specify the operation
`"__source__": "stream_name"`. For example, you can create a copy of the `customers` stream
Expand Down Expand Up @@ -519,7 +505,7 @@ stream_maps:
```
````

## Filtering out records from a stream using `__filter__` operation
### Filtering out records from a stream using `__filter__` operation

The `__filter__` operation accepts a string expression which must evaluate to `true` or
`false`. Filter expressions should be wrapped in `bool()` to ensure proper type conversion.
Expand All @@ -546,6 +532,62 @@ stream_maps:
```
````

### Aliasing properties

This uses a "copy-and-delete" approach with the help of `__NULL__`:

````{tab} meltano.yml
```yaml
stream_maps:
customers:
new_field: old_field
old_field: __NULL__
```
````

````{tab} JSON
```json
{
"stream_maps": {
"customers": {
"new_field": "old_field",
"old_field": "__NULL__"
}
}
}
```
````

### Applying a mapping across two or more streams

You can use glob expressions to apply a stream map configuration to more than one stream:

````{tab} meltano.yml
```yaml
stream_maps:
"*":
name: first_name
first_name: __NULL__
```
````

````{tab} JSON
```json
{
"stream_maps": {
"*": {
"name": "first_name",
"first_name": "__NULL__"
}
}
}
```
````

:::{versionadded} 0.37.0
Support for stream glob expressions.
:::

### Understanding Filters' Affects on Parent-Child Streams

Nested child streams iterations will be skipped if their parent stream has a record-level
Expand Down Expand Up @@ -625,3 +667,17 @@ Additionally, plugins are generally expected to fail if they receive unexpected
arguments. The intended use cases for stream map config values are user-defined in nature
(such as the hashing use case defined above), and are unlikely to overlap with the
plugin's already-existing settings.

### Q: What is the difference between `primary_keys` and `key_properties`?

**Answer:** These two are _generally_ identical - and will only differ in cases like the above where `key_properties` is manually
overridden or nullified by the user of the tap. Developers will specify `primary_keys` for each stream in the tap,
but they do not control if the user will override `key_properties` behavior when initializing the stream. Primary keys
describe the nature of the upstream data as known by the source system. However, either through manual catalog manipulation and/or by
setting stream map transformations, the in-flight dedupe keys (`key_properties`) may be overridden or nullified by the user at any time.

Additionally, some targets do not support primary key distinctions, and there are valid use cases to intentionally unset
the `key_properties` in an extract-load pipeline. For instance, it is common to intentionally nullify key properties to trigger
"append-only" loading behavior in certain targets, as may be required for historical reporting. This does not change the
underlying nature of the `primary_key` configuration in the upstream source data, only how it will be landed or deduped
in the downstream source.
18 changes: 2 additions & 16 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 1 addition & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,6 @@ pyarrow = { version = ">=13", optional = true }

# Testing dependencies installed as optional 'testing' extras
pytest = {version=">=7.2.1", optional = true}
pytest-durations = {version = ">=1.2.0", optional = true}

# installed as optional 'faker' extra
faker = {version = ">=22.5,<27.0", optional = true}
Expand All @@ -110,7 +109,6 @@ docs = [
s3 = ["fs-s3fs"]
testing = [
"pytest",
"pytest-durations"
]
parquet = ["numpy", "pyarrow"]
faker = ["faker"]
Expand Down Expand Up @@ -143,7 +141,7 @@ types-PyYAML = ">=6.0.12"
pytest-codspeed = ">=2.2.0"

[tool.pytest.ini_options]
addopts = '--ignore=singer_sdk/helpers/_simpleeval.py -m "not external"'
addopts = '--durations=10 --ignore=singer_sdk/helpers/_simpleeval.py -m "not external"'
filterwarnings = [
"error",
"ignore:Could not configure external gitlab tests:UserWarning",
Expand Down Expand Up @@ -251,7 +249,6 @@ DEP002 = [
"sphinx-reredirects",
# Plugins
"fs-s3fs",
"pytest-durations",
]

[tool.mypy]
Expand Down

0 comments on commit 2df9b04

Please # to comment.