-
Notifications
You must be signed in to change notification settings - Fork 5
Home
With the rise of the Modern Data Platforms, much time has been spent on discussions about tech stacks and new approaches, such as data lakehouse, data mesh, and Reverse ETL.
However, even with the existence of a diverse ecosystem of tools and methodologies, data pipelines remains a partially solved issue - which should've been solved by now! Data Engineers still struggle regarding integrating different tools, and avoiding reinventing the wheel each time a new data pipeline is required.
With that in mind, we at Dotz Inc. developed Debussy: a low-code data engineering framework, that aims to be easy to use and extensible, while following the best practices of OOP and each underlying tool.
Debussy is an opinionated framework for data architecture and engineering, following a low-code approach. Debussy is currently based on Apache Airflow to act as it's underlying orchestration tool, but it has a pluggable architecture which allows it to build on other proven projects such as Luigi, Prefect, or Dagster.
Furthermore, it's pluggable pipelines architecture, which builds on Airflow's extensive providers library, supports distributed data processing frameworks (e.g. Apache Spark, Apache Beam), cloud storage solutions (e.g. S3, GCS), cloud data warehouses (e.g. BigQuery), among other integrations (e.g REST APIs, FTP/SFTP, etc).
We currently support only Google Cloud based deployments, offering two types of pipelines:
- Data Ingestion
- Reverse ETL
Debussy evolved out of data engineering work that Dotz's Big Data team began in 2019. In 2021, Debussy continued to evolve and mature, with new features and an extensive refactoring. Today it's used every day by all data teams at Dotz Inc. This has resulted in a dramatic improvement in time to delivery and costs related to data engineering projects, while ensuring we're always following data architecture and software engineering best practices.
The following links provide more context around Debussy and the challenges that it attempts to address:
Tribute to Claude Debussy, famous French composer, seen as the first Impressionist composer, and among the most influential composers of the late 19th and early 20th centuries. The framework acts as a composer, automatically generating (composing) our data pipelines (aka. compositions) from templates, hence it's name!
- Debussy Concert: data engineering framework with code generation for orchestration tools - currently only for Airflow, others on Roadmap.
- Debussy Framework (aka Debussy Airflow Plugin): Airflow library with custom hooks, sensors, and operators.
- Debussy CLI: app responsible to parse yaml config files, generating the corresponding artifacts (data pipeline code, database tables, etc).
Debussy works by:
- Providing a semantic model to abstract the development of data pipeline templates among distinct tools.
- DRY (Don't Repeat Yourself): avoid the development of unnecessary and redundant code every time a new data pipeline is required.
- Composition: data pipeline – a DAG in Airflow.
- Movement: major part of a data pipeline (e.g ingestion of a table). Use of TaskGroups, acting as a “parent” of the whole group.
- Phrase: logical unit of part of the pipeline (e.g. SodaDataQuality), defined as a TaskGroup.
- Motif: a operator or small set of cohesive operators, that (together) executes a single operation (e.g. CreateDataprocClusterMotif).