Welcome to the repository for the Databricks 1:M Delta Live Tables Workshop!
This repository contains the notebooks that are used in the workshop to demonstrate the use of Delta Live Tables to build simple, scalable , production-ready pipelines that provides built-in data quality controls and monitoring, data pipeline logging, data lineage tracking, automated pipeline orchestration, automatic Error Handling, advanced auto-scaling, change data capture (CDC) and advanced data engineering concepts (window functions and meta-programming) into a simple pipeline.
See below links for more documentation:
- How to Process IoT Device JSON Data Using Apache Spark Datasets and DataFrames
- Spark Structure Streaming
- Beyond Lambda
- Delta Lake Docs
- Medallion Architecture
- Cost Savings with the Medallion Architecture
- Change Data Capture Streams with the Medallion Architecture
The workshop consists of 4 interactive sections that are separated by 4 notebooks located in the notebooks folder in this repository. Each is run sequentially as we explore the abilities of the lakehouse from data ingestion, data curation, and performance optimizations
Notebook | Summary |
---|---|
01-Structured Streaming with Databricks Delta Tables |
Processing and ingesting data at scale utilizing databricks tunables and the medallion architecture |
02-Orchestrating with Delta Live Tables |
Changing Spark Properties, Configuring Table Properties, Optimization of Tables, Combining Batch and Incremental Tables |
03. Implement CDC In DLT Pipeline: Change Data Capture (Python) |
Implementing Change Data Capture in DLT pipelines for accessing to fresh data |
04: Meta-programming |
Examples of metaprogramming in DLT When to use/problems is solved How to configure |
05: ML Models in DLT Pipelines |
Example of integratation of ML models with DLT pipelines |
This workshop requires a running Databricks workspace. If you are an existing Databricks customer, you can use your existing Databricks workspace. Otherwise, the notebooks in this workshop have been tested to run on Databricks Community Edition as well.
The features used in this workshop require DBR 9.1 LTS
+.
If you have repos enabled on your Databricks workspace. You can directly import this repo and run the notebooks as is and avoid the DBC archive step.
Download the DBC archive from releases and import the archive into your Databricks workspace.