Skip to content
Nilton Duarte edited this page Nov 3, 2022 · 12 revisions

Debussy

  1. What Is Debussy?
  2. What Is Debussy For?
  3. How Debussy works?

What Is Debussy?

With the rise of the Modern Data Platforms, much time has been spent on discussions about tech stacks and new approaches, such as data lakehouse, data mesh, and Reverse ETL.

However, even with the existence of a diverse ecosystem of tools and methodologies, data pipelines remains a partially solved issue - which should've been solved by now! Data Engineers still struggle regarding integrating different tools, and avoiding reinventing the wheel each time a new data pipeline is required.

With that in mind, we at Dotz Inc. developed Debussy: a low-code data engineering framework, that aims to be easy to use and extensible, while following the best practices of OOP and each underlying tool.

What Is Debussy For?

Debussy is an opinionated framework for data architecture and engineering, following a low-code approach. Debussy is currently based on Apache Airflow to act as it's underlying orchestration tool, but it has a pluggable architecture which allows it to build on other proven projects such as Luigi, Prefect, or Dagster.

Furthermore, it's pluggable pipelines architecture, which builds on Airflow's extensive providers library, supports distributed data processing frameworks (e.g. Apache Spark, Apache Beam), cloud storage solutions (e.g. S3, GCS), cloud data warehouses (e.g. BigQuery), among other integrations (e.g REST APIs, FTP/SFTP, etc).

We currently support only Google Cloud based deployments, offering three types of pipelines:

  • Data Ingestion
  • Reverse ETL
  • Data Transformation

History of Debussy

Debussy evolved out of data engineering work that Dotz's Big Data team began in 2019. In 2021, Debussy continued to evolve and mature, with new features and an extensive refactoring. Today it's used every day by all data teams at Dotz Inc. This has resulted in a dramatic improvement in time to delivery and costs related to data engineering projects, while ensuring we're always following data architecture and software engineering best practices.

The following links provide more context around Debussy and the challenges that it attempts to address:

Why Debussy?

Tribute to Claude Debussy, famous French composer, seen as the first Impressionist composer, and among the most influential composers of the late 19th and early 20th centuries. The framework acts as a composer, automatically generating (composing) our data pipelines (aka. compositions) from templates, hence it's name!

How Debussy works?

Overview

The framework's philosophy is to persist¹ the data after each transformation.

With this philosophy in mind, the framework prioritizes ELT (Extract, Load, Transform) over ETL.

Data ingestion pipelines must implement a step of extracting the source data and persisting it in the Raw Vault layer on parquet whenever the data is loaded (duplicating the data when necessary), without making transformations to the content. Some adjusments to the data may be necessary to load the data in the parquet format. Then the file is loaded into the raw layer, removing the duplicate processing (data may still have duplicates!)

The Reverse ETL pipeline in turn must extract the data from the data lakehouse and take it to the Reverse ETL dataset, where all the data must be persisted after the actual transformations. After that, the data must be taken to a storage layer in the format closest to what it will be sent to the destination. For example, in case of sending a CSV file to an SFTP, the CSV file that will be sent must be saved, in case of making a call in a REST endpoint, the body of the request must be saved as JSON. When the Reverse ETL targets another database, the .SQL with the record insertion code must be saved.

The Transformation pipeline is done using dbt (Data Build Tool), though the use of dbt-bigquery (we currently support only BigQuery as the target Data Lakehouse) and airflow-dbt packages.

¹ Persisted data can be discarded after some time, for example after 6 months it can go to cheaper storage and after 2 years it can be removed from a certain layer

Architecture components

  • Debussy Concert: data engineering framework with code generation for orchestration tools - currently support only for Airflow, others on Roadmap.
  • Debussy Airflow: python library for Airflow, with custom hooks, sensors, and operators. It's a dependency for Debussy Concert deployments based on Airflow.
  • Debussy CLI: app responsible to generate yaml config files (environment, composition) and the dag python files. Currently under development.

Debussy works by:

  • Providing a semantic model to abstract the development of data pipeline templates among distinct tools.
  • DRY (Don't Repeat Yourself): avoid the development of unnecessary and redundant code every time a new data pipeline is required.

Semantic Model

  • Composition: data pipeline – a DAG in Airflow.
  • Movement: major part of a data pipeline (e.g ingestion of a table). Use of TaskGroups, acting as a “parent” of the whole group.
  • Phrase: logical unit of part of the pipeline (e.g. SodaDataQuality), defined as a TaskGroup.
  • Motif: a operator or small set of cohesive operators, that (together) executes a single operation (e.g. CreateDataprocClusterMotif).
Clone this wiki locally