Python script to scrape historical PSI readings from www.haze.gov.sg into pandas DataFrame and CSV formats.
During the 2013 Southeast Asian haze, I was interning with the Solar Energy Research Institute of Singapore (SERIS). One of my tasks was to transcribe a year of PSI readings (one day of PSI readings -> one CSV file), which I chose to automate.
Although my 3-year NDA has expired, I will not publish the photovoltaic system data that was examined in connection with this dataset. Because I played no role in its collection, I continue to identify that photovoltaic system data as the property of SERIS.
Consequently, this dataset is of limited use in and of itself. It may however be useful when paired with other datasets, for example in searching for possible correlations with PSI.
As of April 2016, data.gov.sg publishes an API for obtaining PSI and PM2.5 readings. Because as of December 2016 the documentation links are still broken, I have chosen to publish this script - for future data collection, however, I recommend using the data.gov.sg API directly.
Run this script from the command line with python psi_scraper.py
. This script should continue to work as long as the HTML structure of the www.haze.gov.sg/haze-updates/historical-psi-readings/
page does not change.
There are three convenience datasets provided in the datasets
folder,
datasets/sg-nea-psi.pickle
: onepickle
containing a DataFrame with all the datadatasets/sg-nea-psi.csv
: oneCSV
containing all the datadatasets/sg-nea-psi.zip
: individualCSV
files by day
The provided datasets range from the beginning of published data until the end of November 2016, i.e. 2010-01-01 to 2016-11-30. The data is as-is from the NEA website, i.e. you will still need to perform data munging.
Furthermore note that prior to April 1, 2014, there were PM2.5 readings. However, beginning from that date inclusive, PM2.5 readings were subsumed into PSI.
This script has been ported to Python 3.5 and uses the following libraries:
Do whatever you want to the code, attribution is up to you. Note, however, that the dataset ultimately belongs to the National Environment Agency of Singapore. If there are any legal and/or licensing issues related to this repository, please send me a pull request or email.