Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: add ability to perform comprehensions in stream maps #2002

Closed
haleemur opened this issue Oct 7, 2023 · 4 comments · Fixed by #2003
Closed

feat: add ability to perform comprehensions in stream maps #2002

haleemur opened this issue Oct 7, 2023 · 4 comments · Fixed by #2003
Labels
kind/Feature New feature or request valuestream/SDK

Comments

@haleemur
Copy link
Contributor

haleemur commented Oct 7, 2023

Feature scope

Taps (catalog, state, stream maps, tests, etc.)

Description

It would be helpful to have a way to parse out specific information from json fields. Currently my team is working on an integration where some information in an array of objects are considered too sensitive to bring into the data warehouse.

For instance, if a field custom_fields in stream test_data contains the json object [{"key": "a", "value": 1}, {"key": "b", "value": 2}, {"key": "c", "value": 10}], and the fields b & c are deemed sensitive and should not be brought into the data warehouse, then the following stream map definition would be ideal.

stream_maps:
    test_data:
        custom_fields: __NULL__
        custom_field_dict: "[x['value'] for x in custom_fields if x['key'] == 'a'][0]"

however this is not currently possible, it fails with the error

singer_sdk.exceptions.MapExpressionError: Failed to evaluate simpleeval expressions [x['value'] for x in custom_fields if x['key'] == 'a'][0]

Adding support for this should be fairly easy. simpleeval supports compound types through the EvalWithCompoundTypes class.

I believe the following changes would have to be made in mapper.py

# add new function
def compound_eval(expr, operators=None, functions=None, names=None):
    s = simpleeval.EvalWithCompoundTypes(operators=operators, functions=functions, names=names)
    return s.eval(expr)
# replace this (328 - 332):
            result: str | int | float = simpleeval.simple_eval(
                expr,
                functions=self.functions,
                names=names,
            )
# with this:
            result: str | int | float = compound_eval(
                expr,
                functions=self.functions,
                names=names,
            )
@edgarrmondragon
Copy link
Collaborator

Hey @haleemur! Thanks for logging. Do you know if there are any significant downsides to using the EvalWithCompoundTypes class?

For example, drastic performance penalties?

I'm more confident about catching regressions in common expressions since we have tests for those but the project's currently lacking performance regression tests.

@haleemur
Copy link
Contributor Author

haleemur commented Oct 8, 2023

Thank you @edgarrmondragon. I'll take a stab at writing out some performance tests this weekend.

My initial draft turned out a little more complicated than i had assumed.

the current way that type hints are provided (i.e. wrapping the transforming expression in an int()|float() etc). doesn't work when the transformed value is None, because int(None)|bool(None)|float(None) will throw a type error.

This makes it unwieldy to type out complex expressions (makes us increase complexity), which harm programmer ergonomics as well as performance.

for example:
int(dict([(x["id"], x["value"]) for x in custom_fields]).get(2)) will fail if the resulting dictionary does not have a key called 2 or it is explicitly set to None, and we have to use a ternary statement to get around this limitation.

int(dict([(x["id"], x["value"]) for x in custom_fields]).get(2)) if dict([(x["id"], x["value"]) for x in custom_fields]).get(2) else None

It would be nice to instead get the programmer to specify the resulting type explicitly in the tap config.

@haleemur
Copy link
Contributor Author

haleemur commented Oct 8, 2023

Was reading the simple-eval docs, and it is mentioned

One useful feature of using the SimpleEval object is that you can parse an expression once, and then evaluate it mulitple times using different names

Meltano SDK currently does not do this, and I think this optimization is going to benefit stream map performance immensely.

@edgarrmondragon
Copy link
Collaborator

It would be nice to instead get the programmer to specify the resulting type explicitly in the tap config.

@haleemur agreed! I've logged #2010

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/Feature New feature or request valuestream/SDK
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants