Skip to content

A sample project designed to demonstrate ETL process using Pyspark & Spark SQL API in Apache Spark.

Notifications You must be signed in to change notification settings

PujitH-V/ETL_with_Pyspark_-_SparkSQL

Repository files navigation

ETL_with_Pyspark_-_SparkSQL

A sample project designed to demonstrate ETL process using Pyspark & Spark SQL API in Apache Spark.

In this project I used Apache Sparks's Pyspark and Spark SQL API's to implement the ETL process on the data and finally load the transformed data to a destination source.

I have used Azure Databricks to run my notebooks and to create jobs for my notebooks. To orchestrate the entire workflow, I have used Azure data factory to create the pipelines.

Note: Any resources deployed in azure has an associated price involved. So, user's are wholely responsible for creating and deploying resources to azure and also responsible for all the charges that are incurred if any.

-------------------************************-------------------

main_latest branch:

This branch contains the updated code of the main project that's under main_old branch.

New implementaions/changes:

When compared with the code of main_old branch the number of notebooks and the number of lines of code were decreased to acheive the goal of automating the entire ETL process by creating a single generic notebook that will be used for performing the transformations on the data.

I will be updating this readme file soon with the links of medium post and youtube video where i have clearly explained the new changes that i have done to the old notebooks/code.

About

A sample project designed to demonstrate ETL process using Pyspark & Spark SQL API in Apache Spark.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages