Udacity-Data-Engineering-Projects

Project 1: Data Modeling with Postgres

A startup called Sparkify want to analyze the data they have been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. The aim is to create a Postgres Database Schema and ETL pipeline to optimize queries for song play analysis.

Link: Data_Modeling_with_Postgres

Project 2: Data Modeling with Apache Cassandra

In this project, I would be applying Data Modeling with Apache Cassandra and complete an ETL pipeline using Python. I will build a Data Model around my queries that I want to get answers for. For my use case, I want below answers:

Get details of a song that was herad on the music app history during a particular session.
Get songs played by a user during particular session on music app.
Get all users from the music app history who listened to a particular song.

Link: Data_Modeling_with_Apache_Cassandra

Project 3: Data Warehouse with AWS

Project is related to application of Data warehouse and AWS to build an ETL Pipeline for a database hosted on Redshift. I Will need to load data from S3 to staging tables on Redshift and execute SQL Statements that create fact and dimension tables from these staging tables to create analytics.

Use Redshift IaC script: Redshift_IaC_README

Link: Data_Warehouse_with_AWS

Project 4: Data Lake with Apache Spark

In this project, I will build a Data Lake on AWS cloud using Spark and AWS EMR cluster. The data lake will serve as a Single Source of Truth for the Analytics Platform. I will write spark jobs to perform ELT operations that picks data from landing zone on S3 and transform and stores data on the S3 processed zone.

Link: Data_Lake_with_Spark

Project 5: Data Pipelines with Airflow

In this project, I will orchestrate the Data Pipeline workflow using an open-source Apache project called Apache Airflow. I will schedule our ETL jobs in Airflow, create project related custom plugins and operators and automate the pipeline execution.

Link: Data_Pipelines_with_Airflow

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Data-Lakes-with-Spark		Data-Lakes-with-Spark
Data-Modeling-with-Apache-Cassandra		Data-Modeling-with-Apache-Cassandra
Data-Modeling-with-Postgres		Data-Modeling-with-Postgres
Data-Pipelines-with-Airflow		Data-Pipelines-with-Airflow
Data-Warehouse-with-AWS		Data-Warehouse-with-AWS
.gitattributes		.gitattributes
DataEngineering.jpg		DataEngineering.jpg
README.md		README.md
Redshift_Cluster_IaC.py		Redshift_Cluster_IaC.py
Redshift_IaC_README.md		Redshift_IaC_README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Udacity-Data-Engineering-Projects

Project 1: Data Modeling with Postgres

Project 2: Data Modeling with Apache Cassandra

Project 3: Data Warehouse with AWS

Project 4: Data Lake with Apache Spark

Project 5: Data Pipelines with Airflow

About

Releases

Packages

Languages

saurabhsoni5893/Udacity-Data-Engineering-Projects

Folders and files

Latest commit

History

Repository files navigation

Udacity-Data-Engineering-Projects

Project 1: Data Modeling with Postgres

Project 2: Data Modeling with Apache Cassandra

Project 3: Data Warehouse with AWS

Project 4: Data Lake with Apache Spark

Project 5: Data Pipelines with Airflow

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages