For data science students in Singapore, it is hard to find detailed, yet publicly available local datasets for lessons or personal projects. I came across a multi-decade collection of weather data on the Singapore Met Service's website by chance, and decided to assemble it for future use, or in case the data is taken offline.
I'm also using the dataset for a series of self-assigned data science projects, starting with visualisation. I will include time series and machine learning forecasts in future updates to this project.
There are 5 sections so far. The CSV files containing the daily and monthly weather data are in the raw folder. Those who want to assemble their own datasets should head there first.
What you'll find in the raw folder:
-
444 CSV files containing daily weather data for Singapore from 1983 - 2019 (Dec)
-
A "monthly_data" sub-folder containing monthly average data for rainfall, maximum and mean temperatures.
What you'll find in the data folder:
-
4 CSV files processed in the notebook 1.0_data_cleaning_cch
-
2 CSV files related to outlier detection, as processed in the notebook 3.0_outlier_detection_cch.ipynb
-
1 CSV file related to the Q3 2019 scorcher in Singapore
-
1 CSV file related to the notebooks for machine learning and deep learning, as processed in notebook5.0 and 1 validation dataset.
The lack of seasonal variations lull many into thinking that Singapore's weather is predictable and unchanging. Nothing is further from the truth, with climate change making the city state's weather even more unpredictable.
In notebook 2.0_visualisation_cch, I'll attempt to illustrate the changing weather patterns in Singapore using classic as well as new visualisation libraries/techniques like Plotly Express.
Medium post: Visualising Singapore’s Changing Weather Patterns: 1983–2019
Data visualisation provide an easy way to spot outliers. But when you have 36 years of weather data, it won't be enough or efficient to rely solely on charts to accurately pick out the outliers.
In the third section of this project, I'll use Scikit-learn's Isolation Forest model as well as the PyOD library (Python Outlier Detection) to try to pinpoint anomalies in the dataset. This is also important pre-work for Part IV of the project - time series forecasting, where removal of the outliers would be key to more accurate predictions.
Medium post: Detecting Abnormal Weather Patterns With Data Science Tools
This fourth notebook is a short follow-up of sorts to Part II, looking at how temperatures during the three months between July and September 2019 were among the warmest Singapore had experienced over the last 36 years, as global temperature records tumbled around the world.
Medium post: SCORCHER: As Global Records Tumbled, S’pore Baked Under One Of The Warmest Q3 Ever
You are ready to dip your toes into deep learning but not sure where to start. One way is to build on what you've been doing in Scikit-learn, and apply useful features like pipelines and grid search via the Keras wrappers.
This fifth series of notebooks starts with a simple example on pipeline construction and grid search for a binary classification problem, using the Logistic Regression and XGBoost Classifier.
In notebook 5.2, I tackled the same problem using the Keras Classifier, which introduces the concept of defining and building a Keras sequential model.
In notebook 5.3, I experimented with the relatively new Keras Tuner as an alternative to the Scikit-learn/grid search approach.
Data preparation for this section of the project are in notebook 5.1. The validation dataset is here.
Medium Post: https://bit.ly/2QJdrpD