Include required columns in exposures? #3533

louis-vines · 2021-07-05T11:21:00Z

Describe the feature

I'm not sure what other people think of this but I think it would be nice feature to be able define required columns in exposures? This would mean that the exposure definitions would be similar to API definitions and thus we could make our DBT runs fail if we remove a column which is required by an exposure. Feeback on this idea welcome...

Describe alternatives you've considered

I haven't thought about alternatives but I guess any other way where we can ensure that certain columns aren't getting removed from models as we are certain they are required by downstream systems.

Additional context

NA

Who will this benefit?

This will improve robustness of interfacing DBT models with downstream systems.

Are you interested in contributing this feature?

I would be happy to start looking at this but I'm unfamiliar with the DBT codebase at present so may need some help!

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2021-07-18T17:50:31Z

@louis-vines Thanks for opening, and sorry for the delay in getting back to you. I'm definitely interested in hearing more about what you're thinking here.

There was a good Slack thread about this a few months ago. I've been hesitant in the past to define specific columns in exposures, for fear that it can get quickly out of sync with the BI layer / downstream consumption and require a dbt project PR to update each time. This is also captured in #2835 (comment):

wonder if there's some way to compare the table shapes for things involved in exposures and alert if columns vanish in CI vs the current version.

This is where my head goes as well!

Could we leverage the catalog.json artifact for this? It contains the set of columns in every dbt (non-ephemeral) model. If dbt could compare the catalog from a prod run with the catalog from a CI run—assuming you had a docs generate step in both—it could check all models/sources that are 1+exposure:*, and raise a warning if any columns have been removed, or even if their data type has changed...

That being said, I see the argument in favor of having this as an option, when you're ready to lock in a set of columns that are definitely used downstream.

One way to emulate this functionality today: Create a view/ephemeral model, paired with a one-off test, that selects the set of required columns. The appeal of using a model here is that you can ref() all the required resources in that model, and then the exposure can simply ref() that "exposed" model.

That model also includes one more critical piece: the join criteria between multiple models/sources, if the exposure depends on more than one, since the required columns could be coming from any of them. I suppose another way around this would be to require explicit name-spacing for columns that could exist in both.

So when I think about something like dbt test -m exposure:*, I think about a test that would do just the above. I'm toying with some of the possible syntax; I'm not in love with either of these options just yet:

Explicit namespacing for shared columns. This is probably enough for what we need to test, combined with some clever limit/filter/cross joining.

exposures:
  - name: my_dashboard
    depends_on:
      - ref('orders')
      - ref('customers')
    required_columns:
      - order_id
      - customers.customer_id
      - order_date
      - order_amount

-- test SQL
select order_id, customers.customer_id, order_date, order_amount
from (select * from {{ ref('orders') }} where 1=2 limit 0) orders
cross join (select * from {{ ref('customers') }} where 1=2 limit 0) customers

Define join criteria within the exposure. This could enable other powerful use cases for exposures, e.g. an ERD per exposure, inferred from join criteria and/or unique + relationships tests: Create ERD from relationships in schema.yml files dbt-docs#84. Where would this go? It almost makes sense in depends_on:

exposures:
  - name: my_dashboard
    depends_on:
      - ref('orders')
      - ref('customers'):  # i dont love this
          join: left
          on: orders.order_id = ref('customers').order_id
    required_columns:
      - order_id
      - customer_id
      - order_date
      - order_amount

-- test SQL
select order_id, customers.customer_id, order_date, order_amount
from (select * from {{ ref('orders') }} where 1=2 limit 0) orders
left join (select * from {{ ref('customers') }} where 1=2 limit 0) customers
  on orders.order_id = customers.order_id

Very curious to hear what you're thinking!

dbrtly · 2021-08-03T12:26:25Z

Suggested syntax:

exposures:
  name: customer_orders
  - model: ref('orders')
    relationship:
      - model: ref('customers')
        type: left
        join_on: ref('orders').order_id = ref('customers').order_id
      - model: ref('product')
        type: left
        using: product_id

exposures:
  # columns required in all exposures
  required_columns:
    - _ingested_at
    - _curated_at
  # columns required in ‘jaffle_shop’ exposure
  # three required columns in total here
  jaffle_shop:
    required_columns:
       - customer_id

louis-vines added enhancement New feature or request triage labels Jul 5, 2021

jtcohen6 removed the triage label Jul 18, 2021

jtcohen6 added the discussion label Nov 18, 2021

jtcohen6 mentioned this issue Apr 19, 2022

[Feature] dbt should know about joins #4125

Closed

1 task

dbt-labs locked and limited conversation to collaborators Apr 19, 2022

jtcohen6 converted this issue into discussion #5102 Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Include required columns in exposures? #3533

Include required columns in exposures? #3533

louis-vines commented Jul 5, 2021 •

edited

Loading

jtcohen6 commented Jul 18, 2021

dbrtly commented Aug 3, 2021 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Include required columns in exposures? #3533

Include required columns in exposures? #3533

Comments

louis-vines commented Jul 5, 2021 • edited Loading

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

jtcohen6 commented Jul 18, 2021

dbrtly commented Aug 3, 2021 • edited Loading

This issue was moved to a discussion.

louis-vines commented Jul 5, 2021 •

edited

Loading

dbrtly commented Aug 3, 2021 •

edited

Loading