- Data Location is a web url (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), Data needs to be downloaded to local environment and saved as raw files in cloud location. For the purpose of this project only 2020 data for Yellow Taxis is relevant
- Data has a lot of anomalies, which needs cleanup in terms of datatype definition,duplicate purging, removal of irrelavant data etc.
- The cleaned data in above step needs to be saved in Parquet file format.
- By reading this data stored in parquet file format, insights needs to be driven which constitue of
- finding out DOLocation for highest tip, in proportion to cost of ride, per quarter.
- finding out at which hour of the day the speed of taxis is highest.
- Above insights needs to be stored in a relational database
- Rest API would serve the insights to end user from the relational database.
- Compute : I'm using Databricks's infrastructure for ETL and Insights on pyspark,scala and spark sql
- Raw Storage : For raw layer I'm using databrick's file system(In a production scenario I'd have choosen ADLS which has a good connectivity with Azure services for ETL and building representational layer on top of the data.
- Relational Storage : I've choosen Microsoft SQL Server on Azure for this use case , where I'm hosting my two tables for trip_speed and max_tip.
- End User accessibility : I"m making the data available to end user through a rest api endoint hosted on my local machine.
- Exact Assignment - ETL.ipynb : is a ipynb which can be directly imported to databricks cluster and ran after installing wget library
- Once the data is written to rawFiles folder in DBFS , the second notebook Exact Assignment - Insights.ipynb can be run to perform analytics on the data.Make sure you spin a Database and setup a connection for an external SQL SERVER Database to store the query results.
- Once the Data is ready in SQL Tables, Code Folder's api.py can be executed which internally calls the db_models.py and Config.py
- On the local host the below end points can be hit , which in the backend will hit the sqlserver database and fetch the results.
- /api/tip/2020/01/max
- /api/tips/2020/max
- /api/speed/2020/01/01/max
- Fin!
I'm considering Microsoft Azure stack services for different compute and storage.Soulution has 4 technical pillars
I have used databricks community addition for most of the scrathing on dataframes building and data cleansing. Architecture Diagram