Skip to content

04 Data Factory Pipelines

Benny Austin edited this page Nov 10, 2024 · 6 revisions

The accelerator features pre-built data factory pipelines for both ingestion and transformation. These pipelines can be easily deployed with configuration updates to the ELT Framework metadata and connection strings.

Ingestion Pipelines - Source to Bronze

Ingestion pipelines transfer data as-is from source systems to the OneLake bronze layer. Our design philosophy encourages the creation of generic data factory pipelines for various data sources, which can be reused and scaled using metadata from the ELT Framework.

This generic pipeline connects to a SQL Server and ingests data from tables, views, and queries, supporting both full and delta loads. The SQL Server instance can be either cloud-based or on-premise. Data is stored in the OneLake bronze layer as parquet files. The pipeline utilizes metadata from the ELT Framework to manage source queries for both full and incremental loads, as well as file, folder, and partition management of parquet files in the OneLake bronze layer. It also creates the necessary metadata for subsequent transformation pipelines.

[b] File Ingestion Pipelines

This generic pipeline ingests data from files dropped in OneLake or any HDFS-compliant storage. This feature is planned for a future release.

[c] REST API Ingestion Pipelines

This generic pipeline ingests data from REST APIs, landing the data as JSON in the OneLake bronze layer. This feature is planned for a future release.


Transformation Pipelines

Transformation pipelines process data from the OneLake bronze layer, cleansing, enriching, curating, and applying business rules to make the data available in the silver and gold layers of OneLake through Level 1 and Level 2 transformations, respectively.

Level 1 transformation pipelines are reusable data factory pipelines that take data from the OneLake bronze layer, cleanse, enrich, curate, and land it in the OneLake silver layer while maintaining data granularity. Only one instance of these pipelines is needed in your data platform. Spark notebooks handle the heavy-lifting data transformations and can be switched in and out of this pipeline using metadata from the ELT Framework. This accelerator includes pre-built spark notebooks that can be used with minimal configuration changes in this pipeline. You can also add new spark notebooks and orchestrate them through the ELT Framework configuration.

Level 2 transformation pipelines are reusable data factory pipelines that take data from the OneLake silver layer, and occasionally from the bronze layer, and apply custom business rules to land the data in the OneLake gold layer. The original data granularity is usually lost in this layer due to the application of business rules. Typical transformations include aggregation, consolidation, merging data from different source systems, snapshots, and star schema. Only one instance of these pipelines is needed in your data platform. Either spark notebooks or stored procedures handle the heavy-lifting business rules data transformations and can be switched in and out of this pipeline using metadata from the ELT Framework.

Utility Pipelines

A pipeline that serves as a useful reference for populating ELT Framework metadata. In this instance preparing metadata for all tables in Wide World Importers (WWI) SQL database.