This repo contains a demo project suited to leveraging Datafold:
- dbt project that includes
- raw data (implemented via seed CSV files) from a fictional app
- a few downstream models, as shown in the project DAG below
- several 'master' branches, corresponding to the various supported cloud data platforms
master
- 'primary' master branch, runs in Snowflakemaster-databricks
- 'secondary' master branch, runs in Databricks, is reset to themaster
branch daily or manually when needed via thebranch_replication.yml
workflowmaster-bigquery
- 'secondary' master branch, runs in BigQuery, is reset to themaster
branch daily or manually when needed via thebranch_replication.yml
workflowmaster-dremio
- 'secondary' master branch, runs in Dremio, is reset to themaster
branch daily or manually when needed via thebranch_replication.yml
workflow
- several GitHub Actions workflows illustrating CI/CD best practices for dbt Core
- dbt PR job - is triggered on PRs targeting the
master
branch, runs dbt project in Snowflake - dbt prod - is triggered on pushes into the
master
branch, runs dbt project in Snowflake - dbt PR job (Databricks) - is triggered on PRs targeting the
master-databricks
branch, runs dbt project in Databricks - dbt prod (Databricks) - is triggered on pushes into the
master-databricks
branch, runs dbt project in Databricks - dbt PR job (BigQuery) - is triggered on PRs targeting the
master-bigquery
branch, runs dbt project in BigQuery - dbt prod (BigQuery) - is triggered on pushes into the
master-bigquery
branch, runs dbt project in BigQuery - dbt PR job (Dremio) - is triggered on PRs targeting the
master-dremio
branch, runs dbt project in BigQuery - dbt prod (Dremio) - is triggered on pushes into the
master-dremio
branch, runs dbt project in BigQuery - Apply monitors.yaml configuration to Datafold app - applies monitor-as-code configuration to Datafold application
- raw data generation tool to simulate a data flow typical for real existing projects
- dbt PR job - is triggered on PRs targeting the
All actual changes should be commited to the master
branch, other master-...
branches are supposed to be reset to the master
branch daily.
! To ensure the integrity and isolation of GitHub Actions workflows, it is advisable to create pull requests (PRs) for different 'master' branches from distinct commits. This practice helps prevent cross-PR leakage and ensures that workflows run independently.
To demonstrate Datafold experience in CI on Snowflake - one needs to create PRs targeting the master
branch.
- production schema in Snowflake:
demo.core
- PR schemas:
demo.pr_num_<pr_number>
To demonstrate Datafold experience in CI on Databricks - one needs to create PRs targeting the master-databricks
branch.
- production schema in Databricks:
demo.default
- PR schemas:
demo.pr_num_<pr_number>
To demonstrate Datafold experience in CI on BigQuery - one needs to create PRs targeting the master-bigquery
branch.
- production schema in BigQuery:
datafold-demo-429713.prod
- PR schemas:
datafold-demo-429713.pr_num_<pr_number>
To demonstrate Datafold experience in CI on Dremio - one needs to create PRs targeting the master-dremio
branch.
- production schema in Dremio:
"Alexey S3".alexeydremiobucket.prod
- PR schemas:
"Alexey S3".alexeydremiobucket.pr_num_<pr_number>
To demonstrate Datafold functionality for data replication monitoring, a pre-configured Postgres instance (simulates transactional database) is populated with 'correct raw data' (analytics.data_source.subscription_created
table); the subscription__created
seed CSV file contains 'corrupted raw data'.
- Looker view, explore, and dashboard are connected to the
fct__monthly__financials
model in Snowflake, Databricks, and BigQuery.- Snowflake
fct__monthly__financials
viewfct__monthly__financials
exploreMonthly Financials (Demo, Snowflake)
dashboard
- Databricks
fct__monthly__financials_databricks
viewfct__monthly__financials_databricks
exploreMonthly Financials (Demo, Databricks)
dashboard
- BigQuery
fct__monthly__financials_bigquery
viewfct__monthly__financials_bigquery
exploreMonthly Financials (Demo, BigQuery)
dashboard
- Snowflake
The corresponding Datafold Demo Org contains the following integrations:
- Common
datafold/demo
repository integrationPostgres
data connection for Cross-DB data diff monitorsLooker Public Demo
BI app integration
- Snowflake specific
Snowflake
data connectionCoalesce-Demo
CI integration for theSnowflake
data connection and themaster
branch
- Databricks specific
Databricks-Demo
data connectionCoalesce-Demo-Databricks
CI integration for theDatabricks-Demo
data connection and themaster-databricks
branch
- BigQuery specific
BigQuery - Demo
data connectionCoalesce-Demo-BigQuery
CI integration for theBigQuery - Demo
data connection and themaster-bigquery
branch
- Dremio specific
Dremio-Demo
data connectionCoalesce-Demo-Dremio
CI integration for theDremio-Demo
data connection and themaster-dremio
branch
To get up and running with this project:
-
Install dbt using these instructions.
-
Fork this repository.
-
Set up a profile called
demo
to connect to a data warehouse by following these instructions. You'll needdev
andprod
targets in your profile. -
Ensure your profile is setup correctly from the command line:
$ dbt debug
- Create your
prod
models:
$ dbt build --profile demo --target prod
With prod
models created, you're clear to develop and diff changes between your dev
and prod
targets.
Follow the quickstart guide to integrate this project with Datafold.
datagen/feature_used_broken.csv
- copied toseeds/feature__used.csv
datagen/feature_used.csv
datagen/org_created_broken.csv
- copied toseeds/org__created.csv.csv
datagen/org_created.csv
datagen/signed_in_broken.csv
- copied toseeds/signed__in.csv.csv
datagen/signed_in.csv
datagen/subscription_created_broken.csv
- copied toseeds/subscription__created.csv.csv
datagen/subscription_created.csv
- pushed to Postgres (analytics.data_source.subscription_created
table)datagen/user_created_broken.csv
- copied toseeds/user__created.csv.csv
datagen/user_created.csv
datagen/persons_pool.csv
- pool of persons used for user/org generation
datagen/data_generate.py
- main data generation scriptdatagen/data_to_postgres.sh
- pushes generated data to Postgresdatagen/persons_pool_replenish.py
- replenishes the pool of persons using ChatGPTdatagen/data_delete.sh
- deletes data for further re-generationdatagen/dremio__upload_seeds.py
- uploads seed files to Dremio (due to limitations in the starndard dbt-dremio connector)
- zero on negative prices in the
subscription__created
seed - corrupted emails in the
user__created
seed (user$somecompany.com) - irregular spikes in the workday seasonal daily number of sign-ins in the
signed__in
seed null
spikes in thefeature__used
seed- schema change: a 'wandering' column appears ~weekly in the
signed__in
seed
- PR job fails when the 2nd commit is pushed to a PR branch targeting Databricks. Most likely related to: databricks/dbt-databricks#691.