A repository to contain the notebook for my big data project involving EDA and Machine Learning on the NY Taxi Fare dataset
-
Motivation: Building and deploying a data science project with cloud technologies such as Apache Spark and AWS. After having enough theoretical knowledge of data science workflow and machine learning techniques and implementing few projects locally, I thought of going further and working out a workflow typically used in the industry for development and deployment (that is cloud operations) and searched a big enough dataset for it.
-
Challenge: Get up and started with Spark and its Python wrapper, PySpark as well as managing clusters on AWS EMR, none of which I had done earlier. This also started my attempt at completing one data science project each month starting with this for October.
-
Accomplishment: Successfully performing EDA and feature engineering on the dataset and using the Spark MLlib to build RF and DT models and achieve an RMSE error of 4.28.
- Python 3
- pandas
- matplotlib
- seaborn
- PySpark
- AWS EMR
- AWS EC2
- Spark MLlib