Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Include required columns in exposures? #3533

Closed
louis-vines opened this issue Jul 5, 2021 · 2 comments
Closed

Include required columns in exposures? #3533

louis-vines opened this issue Jul 5, 2021 · 2 comments
Labels
discussion enhancement New feature or request

Comments

@louis-vines
Copy link

louis-vines commented Jul 5, 2021

Describe the feature

I'm not sure what other people think of this but I think it would be nice feature to be able define required columns in exposures? This would mean that the exposure definitions would be similar to API definitions and thus we could make our DBT runs fail if we remove a column which is required by an exposure. Feeback on this idea welcome...

Describe alternatives you've considered

I haven't thought about alternatives but I guess any other way where we can ensure that certain columns aren't getting removed from models as we are certain they are required by downstream systems.

Additional context

NA

Who will this benefit?

This will improve robustness of interfacing DBT models with downstream systems.

Are you interested in contributing this feature?

I would be happy to start looking at this but I'm unfamiliar with the DBT codebase at present so may need some help!

@louis-vines louis-vines added enhancement New feature or request triage labels Jul 5, 2021
@jtcohen6
Copy link
Contributor

@louis-vines Thanks for opening, and sorry for the delay in getting back to you. I'm definitely interested in hearing more about what you're thinking here.

There was a good Slack thread about this a few months ago. I've been hesitant in the past to define specific columns in exposures, for fear that it can get quickly out of sync with the BI layer / downstream consumption and require a dbt project PR to update each time. This is also captured in #2835 (comment):

wonder if there's some way to compare the table shapes for things involved in exposures and alert if columns vanish in CI vs the current version.

This is where my head goes as well!

Could we leverage the catalog.json artifact for this? It contains the set of columns in every dbt (non-ephemeral) model. If dbt could compare the catalog from a prod run with the catalog from a CI run—assuming you had a docs generate step in both—it could check all models/sources that are 1+exposure:*, and raise a warning if any columns have been removed, or even if their data type has changed...

That being said, I see the argument in favor of having this as an option, when you're ready to lock in a set of columns that are definitely used downstream.

One way to emulate this functionality today: Create a view/ephemeral model, paired with a one-off test, that selects the set of required columns. The appeal of using a model here is that you can ref() all the required resources in that model, and then the exposure can simply ref() that "exposed" model.

That model also includes one more critical piece: the join criteria between multiple models/sources, if the exposure depends on more than one, since the required columns could be coming from any of them. I suppose another way around this would be to require explicit name-spacing for columns that could exist in both.

So when I think about something like dbt test -m exposure:*, I think about a test that would do just the above. I'm toying with some of the possible syntax; I'm not in love with either of these options just yet:

  1. Explicit namespacing for shared columns. This is probably enough for what we need to test, combined with some clever limit/filter/cross joining.
exposures:
  - name: my_dashboard
    depends_on:
      - ref('orders')
      - ref('customers')
    required_columns:
      - order_id
      - customers.customer_id
      - order_date
      - order_amount
-- test SQL
select order_id, customers.customer_id, order_date, order_amount
from (select * from {{ ref('orders') }} where 1=2 limit 0) orders
cross join (select * from {{ ref('customers') }} where 1=2 limit 0) customers
  1. Define join criteria within the exposure. This could enable other powerful use cases for exposures, e.g. an ERD per exposure, inferred from join criteria and/or unique + relationships tests: Create ERD from relationships in schema.yml files dbt-docs#84. Where would this go? It almost makes sense in depends_on:
exposures:
  - name: my_dashboard
    depends_on:
      - ref('orders')
      - ref('customers'):  # i dont love this
          join: left
          on: orders.order_id = ref('customers').order_id
    required_columns:
      - order_id
      - customer_id
      - order_date
      - order_amount
-- test SQL
select order_id, customers.customer_id, order_date, order_amount
from (select * from {{ ref('orders') }} where 1=2 limit 0) orders
left join (select * from {{ ref('customers') }} where 1=2 limit 0) customers
  on orders.order_id = customers.order_id

Very curious to hear what you're thinking!

@jtcohen6 jtcohen6 removed the triage label Jul 18, 2021
@dbrtly
Copy link
Contributor

dbrtly commented Aug 3, 2021

Suggested syntax:

exposures:
  name: customer_orders
  - model: ref('orders')
    relationship:
      - model: ref('customers')
        type: left
        join_on: ref('orders').order_id = ref('customers').order_id
      - model: ref('product')
        type: left
        using: product_id
exposures:
  # columns required in all exposures
  required_columns:
    - _ingested_at
    - _curated_at
  # columns required in ‘jaffle_shop’ exposure
  # three required columns in total here
  jaffle_shop:
    required_columns:
       - customer_id

@dbt-labs dbt-labs locked and limited conversation to collaborators Apr 19, 2022
@jtcohen6 jtcohen6 converted this issue into discussion #5102 Apr 19, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants