In this project, I build a data pipeline with data sent in real time and use DBT to process then store in Snowflake.
Due to hardware limitations in my computer, in this project I built kafka and airflow at the same time. First run move to the "airflow" directory and run:
docker-compose up -d --build
After running successfully, kafka and Airflow have been built on docker. Next, we have to build the dependencies on Snowflake as follows:
- Move to the "scripts" folder and run: python run_scripts.py
Now let's start running Kafka and Airflow to send data to Snowflake and transform the data with Dbt:
-
Run Apache Kafka:
- Move to the "airflow" directory and run: docker-compose up
- Move to the "spark streaming/connection" directory and run: python consumer_bank.py
- Move to the "kafka/connection" directory and run: python producer_bank.py
-
Run Apache Airflow:
- Move to the "airflow" directory and run: docker-compose up
- Data pipeline for my project
- Lineage graph in Dbt
Basically, in this project I want to focus mainly on using Dbt for data transformation because nowadays Dbt is gradually becoming a powerful tool in data processing with SQL statements.