Data Engineering Projects

Data engineering projects on data modelling, data warehousing, data lake development, orchestration and analysis.

Building a Data Lake using Amazon S3 and EMR

Problem Statement:

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Their user base and song database have grown large and want to move their data warehouse to a data lake.

Sparkify user base and song database has grown and want to move their processes and data onto the cloud.

The goal of this project is to develop a data lake and ETL process for song play analysis.

Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. ETL pipeline has to be built that extracts data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables in parquet files. This will allow their analytics team to continue finding insights in what songs their users are listening to.

ETL pipelines that transfers data from files in json format to Amazon S3 are to be developed using python and deploy on a cluster built using Amazon EMR.

Go to project

Building a Data Warehousing with Amazon Redshift

Problem Statement:

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Their analytics team is particularly interested in understanding what songs users are listening to.

Sparkify user base and song database has grown and want to move their processes and data onto the cloud.

Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. ETL pipeline has to be built that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to finding insights.

Staging tables, facts and dimension tables are to be defined in star schema using Amazon Redshift.

ETL pipelines that transfers data from files in json format to Amazon Redshift database are to be developed using python.

Go to project

Data Modelling with Postgres

Problem Statement:

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Their analytics team is particularly interested in understanding what songs users are listening to.

The goal of this project is to develop a data model and ETL process for song play analysis

Data modelling is to be done based on the raw data available in json format. Facts and dimension tables are to be defined in star schema using Postgres database

ETL pipelines that transfers data from files in json format to Postgres database are to be developed using python

Go to project

Data Modelling with Apache Cassandra

Problem Statement:

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Their analytics team is particularly interested in understanding what songs users are listening to.

Currently, for Sparkify there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app.

The goal of this project is to create an Apache Cassandra database which can create queries on song play data analysis.

Data modelling is to be done based on the raw data available in CSV format.

ETL pipelines that transfers data from CSV files to Cassandra database are to be developed using python.

Go to project

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
DataEngineeringPractice		DataEngineeringPractice
DataLakesWithAmazonS3andEMR		DataLakesWithAmazonS3andEMR
DataModelingWithPostgres		DataModelingWithPostgres
DataModellingWithCassandra		DataModellingWithCassandra
DataWarehousingWithAmazonRedshift		DataWarehousingWithAmazonRedshift
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Projects

Building a Data Lake using Amazon S3 and EMR

Building a Data Warehousing with Amazon Redshift

Data Modelling with Postgres

Data Modelling with Apache Cassandra

About

Releases

Packages

Languages

RangaAmirapu/DataEngineeringProjects

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Projects

Building a Data Lake using Amazon S3 and EMR

Building a Data Warehousing with Amazon Redshift

Data Modelling with Postgres

Data Modelling with Apache Cassandra

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages