Skip to content

Latest commit

 

History

History
41 lines (34 loc) · 3.06 KB

README.md

File metadata and controls

41 lines (34 loc) · 3.06 KB

NYC Taxi Data Analytics

Problem Statement

  • Data Location is a web url (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), Data needs to be downloaded to local environment and saved as raw files in cloud location. For the purpose of this project only 2020 data for Yellow Taxis is relevant
  • Data has a lot of anomalies, which needs cleanup in terms of datatype definition,duplicate purging, removal of irrelavant data etc.
  • The cleaned data in above step needs to be saved in Parquet file format.
  • By reading this data stored in parquet file format, insights needs to be driven which constitue of
    • finding out DOLocation for highest tip, in proportion to cost of ride, per quarter.
    • finding out at which hour of the day the speed of taxis is highest.
  • Above insights needs to be stored in a relational database
  • Rest API would serve the insights to end user from the relational database.
    • Solution Approach :

      I'm considering Microsoft Azure stack services for different compute and storage.Soulution has 4 technical pillars

      • Compute : I'm using Databricks's infrastructure for ETL and Insights on pyspark,scala and spark sql
      • Raw Storage : For raw layer I'm using databrick's file system(In a production scenario I'd have choosen ADLS which has a good connectivity with Azure services for ETL and building representational layer on top of the data.
      • Relational Storage : I've choosen Microsoft SQL Server on Azure for this use case , where I'm hosting my two tables for trip_speed and max_tip.
      • End User accessibility : I"m making the data available to end user through a rest api endoint hosted on my local machine.

      I have used databricks community addition for most of the scrathing on dataframes building and data cleansing. Architecture Diagram architecture diagram

      Execution Approach :

      • Exact Assignment - ETL.ipynb : is a ipynb which can be directly imported to databricks cluster and ran after installing wget library
      • Once the data is written to rawFiles folder in DBFS , the second notebook Exact Assignment - Insights.ipynb can be run to perform analytics on the data.Make sure you spin a Database and setup a connection for an external SQL SERVER Database to store the query results.
      • Once the Data is ready in SQL Tables, Code Folder's api.py can be executed which internally calls the db_models.py and Config.py
      • On the local host the below end points can be hit , which in the backend will hit the sqlserver database and fetch the results.
        • /api/tip/2020/01/max
        • /api/tips/2020/max
        • /api/speed/2020/01/01/max
      • Fin!