Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Categorial batches in incremental models #11324

Open
3 tasks done
ehrete opened this issue Feb 19, 2025 · 0 comments
Open
3 tasks done

Categorial batches in incremental models #11324

ehrete opened this issue Feb 19, 2025 · 0 comments
Labels
enhancement New feature or request triage

Comments

@ehrete
Copy link

ehrete commented Feb 19, 2025

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Situation

We want to combine data from different source systems (e.g. SAP, SALESFORCE, ABC) with different transformation logic in a single target table

The current handling of this requirement --> One query with union all operators:

Model definition:

-- logic for source system SAP
SELECT
    *
FROM "SAP"."......."

-- logic for source system SALESFORCE
UNION ALL
SELECT
    *
FROM "SALESFORCE"."......."

-- logic for source system ABC
UNION ALL
SELECT
    *
FROM "ABC"."......."

Disadvantage of such approach:

Proposal / Feature request

  • New function Categorical batches
  • Should use a similar logic as the microbatch materialisation
  • Only active with incremental materialisation
  • Additional properties in the incremental model configuration
    • categorial_batch_column
    • categorial_batch_values

Example - Model definition

{{
    config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key = ["KEY_COLUMN"],

        -- Additional properties
        categorial_batch_column = 'SOURCE_SYSTEM',  --> the column after which is filtered
        categorial_batch_values = ["SAP", "SALESFORCE", "ABC"] --> 3 possible batches
    )
}}


with 

...

-- logic that transforms the sap data (potentially ephemeral model)
sap_data as (

    select
        A,
        B,
        C
        'SAP'  AS "SOURCE_SYSTEM"
    from ...

),

-- logic that transforms the salesforce data (potentially ephemeral model)
salesforce_data as (

    select
        A,
        B,
        C
        'SALESFORCE'  AS "SOURCE_SYSTEM"
    from ...

),

-- logic that transforms the abc data (potentially ephemeral model)
abc_data as (

    select
        A,
        B,
        C
        'ABC'  AS "SOURCE_SYSTEM"
    from ...    
),


-- concatenation of all sources
final as (

    select * from sap_data

    union all

    select * from salesforce_data

    union all

    select * from abc_data

)


select 
    *
from final

Compiled queries

There is one batch for each value in the variable categorial_batch_values. The propsed categorial batch feature packs the compiled model definition into a sub-query and adds the value as a filter. The database pushes the filter down and prunes out the irrelevant values (other source systems).

Batch 1: SAP Data

select 
    *
from 
(

        with 

        ...


        sap_data as (

            select
                A,
                B,
                C
                'SAP'  AS "SOURCE_SYSTEM"
            from ...

        ),

        salesforce_data as (

            select
                A,
                B,
                C
                'SALESFORCE'  AS "SOURCE_SYSTEM"
            from ...

        ),

        abc_data as (

            select
                A,
                B,
                C
                'ABC'  AS "SOURCE_SYSTEM"
            from ...    
        ),


        final as (

            select * from sap_data

            union all

            select * from salesforce_data

            union all

            select * from abc_data

        )


        select 
            *
        from final

) query

where SOURCE_SYSTEM = 'SAP'

Batch 2: Salesforce Data

select 
    *
from 
(

        with 

        ...


        final as (

            select * from sap_data

            union all

            select * from salesforce_data

            union all

            select * from abc_data

        )


        select 
            *
        from final

) query

where SOURCE_SYSTEM = 'SALESFORCE'

Batch 3: ABC Data

select 
    *
from 
(

        with 

        ...


        final as (

            select * from sap_data

            union all

            select * from salesforce_data

            union all

            select * from abc_data

        )


        select 
            *
        from final

) query

where SOURCE_SYSTEM = 'ABC'

Describe alternatives you've considered

Combine different transformation logic with Union all statement (see feature description)

Who will this benefit?

See feature description

Additional benefit: You can divide a demanding sql query into multiple small batches based on a categorial/textual column.

Are you interested in contributing this feature?

No response

Anything else?

No response

@ehrete ehrete added enhancement New feature or request triage labels Feb 19, 2025
@ehrete ehrete changed the title [Feature] <title>Categorial batches in incremental models Categorial batches in incremental models Feb 19, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

1 participant