Spark-ETL-Data-Pipeline-using-SparkStreaming-HDFS-Kafka-Hive

OBJECTIVE

The objectives of this project are to get experience of coding with:

Spark
Spark SQL
Spark Streaming
Kafka
Scala and functional programming

DATA SET

The data set is the one that you analyzed in Course 1 and it is STM GTFS data.

PROBLEM STATEMENT

We get the information of STM every day and need to run an ETL pipeline to enrich data for reporting and analysis purpose in real-time. Data is split in two

A set of tables that build dimension (batch style)
Stop times that needed to be enriched for analysis and reporting (streaming)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Spark-ETL-Data-Pipeline-using-SparkStreaming-HDFS-Kafka-Hive

OBJECTIVE

DATA SET

PROBLEM STATEMENT

Files

README.md

Latest commit

History

README.md

File metadata and controls

Spark-ETL-Data-Pipeline-using-SparkStreaming-HDFS-Kafka-Hive

OBJECTIVE

DATA SET

PROBLEM STATEMENT