-
This repo contains a data pipeline project built to load cv data from an aws-s3 bucket to a redshift database cluster. The project was done in two ways.
- Using only python to build an etl pipeline that takes the csv files from s3 to redshift.
- Using airflow to create the etl pipeline.
- I built an etl pipeline with airflow using python. I defined a star schema data model that consists of 4 dimensional models and a fact table focused on a particular analytical need. The etl loads data from a s3 bucket into a redshift data warehouse for analytics.
I used the popular pagila dataset for this project.
- Dates
- Customer
- Movie
- Store
- Sales Fact
- The project contains 2 folders and 3 files
- dags folder: This contains the dag file.
- plugins folder: this contains the helper folders and operators used in building the airflow data pipeline
- create_tables.py: this file builds the tables in the redshift data warehouse.
- create_tables.sql: this contains the sql queries to create the tables.
- dwh.cfg: contains the configuration codes for our aws infrastructures.
- Create an IAM user and take note of the secret key and access key.
- Create aws redshift role: Create a redshift role under IAM.
- Create Redshift clusters: First you need to create a redshift cluster, with you preferred configuration details and connects the cluster to the redshift IAM role.
- Airflow connection: Create a new airflow connection on airflow containing details of your redshift connections including the database info.
- Create a new variable containing your bucket name.
- Add configuration details to the dwh.cfg file.
- Run create_tables.py to create the data warehouse tables.
- Go to your airflow UI and run the dag.