Skip to content

Latest commit

 

History

History
21 lines (17 loc) · 920 Bytes

File metadata and controls

21 lines (17 loc) · 920 Bytes

Spark-ETL-Data-Pipeline-using-SparkStreaming-HDFS-Kafka-Hive

OBJECTIVE

The objectives of this project are to get experience of coding with:

  • Spark
  • Spark SQL
  • Spark Streaming
  • Kafka
  • Scala and functional programming

DATA SET

The data set is the one that you analyzed in Course 1 and it is STM GTFS data.

PROBLEM STATEMENT

We get the information of STM every day and need to run an ETL pipeline to enrich data for reporting and analysis purpose in real-time. Data is split in two

  1. A set of tables that build dimension (batch style)
  2. Stop times that needed to be enriched for analysis and reporting (streaming)

image

image