NYC Taxi Data Analytics

Problem Statement

Data Location is a web url (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), Data needs to be downloaded to local environment and saved as raw files in cloud location. For the purpose of this project only 2020 data for Yellow Taxis is relevant
Data has a lot of anomalies, which needs cleanup in terms of datatype definition,duplicate purging, removal of irrelavant data etc.
The cleaned data in above step needs to be saved in Parquet file format.
By reading this data stored in parquet file format, insights needs to be driven which constitue of

finding out DOLocation for highest tip, in proportion to cost of ride, per quarter.
finding out at which hour of the day the speed of taxis is highest.

I'm considering Microsoft Azure stack services for different compute and storage.Soulution has 4 technical pillars

Compute : I'm using Databricks's infrastructure for ETL and Insights on pyspark,scala and spark sql
Raw Storage : For raw layer I'm using databrick's file system(In a production scenario I'd have choosen ADLS which has a good connectivity with Azure services for ETL and building representational layer on top of the data.
Relational Storage : I've choosen Microsoft SQL Server on Azure for this use case , where I'm hosting my two tables for trip_speed and max_tip.
End User accessibility : I"m making the data available to end user through a rest api endoint hosted on my local machine.

I have used databricks community addition for most of the scrathing on dataframes building and data cleansing. Architecture Diagram

Exact Assignment - ETL.ipynb : is a ipynb which can be directly imported to databricks cluster and ran after installing wget library
Once the data is written to rawFiles folder in DBFS , the second notebook Exact Assignment - Insights.ipynb can be run to perform analytics on the data.Make sure you spin a Database and setup a connection for an external SQL SERVER Database to store the query results.
Once the Data is ready in SQL Tables, Code Folder's api.py can be executed which internally calls the db_models.py and Config.py
On the local host the below end points can be hit , which in the backend will hit the sqlserver database and fetch the results.
- /api/tip/2020/01/max
- /api/tips/2020/max
- /api/speed/2020/01/01/max
Fin!