Skip to content

๐Ÿ”ฎ๐ŸŒ ML model predicting groundwater levels for French piezometric stations to tackle water shortage issues, built for the Hi!ckathon 2024 ๐Ÿšฑ๐Ÿ’ง

License

Notifications You must be signed in to change notification settings

zhukovanadezhda/water-scarcity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

34 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿšฑ Water Shortage Prediction at Hi!ckathon 2024

hi-paris

๐Ÿ” Overview

This repository contains the work developed by our team for the Hi!ckathon, a competition focused on AI and sustainability organized by Hi! PARIS - the Center on Data Analytics and Artificial Intelligence for Science, Business and Society created by Institut Polytechnique de Paris and HEC Paris and joined by Centre Inria de Saclay. The goal of our project was to build an AI model capable of predicting groundwater levels for French piezometric stations, with a special emphasis on the summer months. Our model uses a variety of data sources, including piezometric data, weather patterns, hydrology, water withdrawal, and economic data, to make accurate predictions.

In addition to model development, we were tasked with considering the real-world application of our solution and projecting how it could be used in the market to address water shortages ๐ŸŒ๐Ÿ’ง

๐Ÿš€ Objective

The primary objective of the project is to:

  • Build a predictive model for forecasting groundwater levels at French piezometric stations.
  • Focus specifically on the summer months, as they are crucial for water resource management.
  • Leverage multiple data sources, including weather, hydrology, water withdrawal, and economic data, to improve prediction accuracy.
  • Explore and design a real-world application of the model to address water shortage issues.

๐Ÿ‘ฅ Our Team

Team Picture

๐ŸŽฏ Our Approach

The target variable is categorical, with 5 balanced classes representing groundwater levels: very low, low, average, high, and very high. Since the data is balanced, no specific techniques for handling imbalanced data were necessary, and the models were trained to perform classification.

The data preprocessing steps included removing columns with over 80% missing values, followed by imputing the remaining missing values with either the median or mode. Feature engineering was then performed, as detailed below. All numeric features were scaled, and the target variable was encoded as integers from 0 to 4.

Subsequently, five models were trained and evaluated using 3-fold cross-validation, with results presented in the results section. The best-performing model was a random forest, which underwent grid search for hyperparameter tuning. The final F1 score of this model on the test set was 58.36%, placing the team 15th out of 60 teams.

Feature Engineering

Feature Description
day Extracted day from the meteo_date. Represents the day of the month.
month Extracted month from the meteo_date. Represents the month of the year.
quarter Extracted quarter from the meteo_date. Represents which quarter of the year (1 to 4).
year Extracted year from the meteo_date. Represents the year of the data point.
day_sin Sin transformation of the day feature. Converts the day of the month into a periodic value for modeling cyclical behavior.
day_cos Cos transformation of the day feature. This, alongside day_sin, captures the periodicity of the day of the month.
month_sin Sin transformation of the month feature. Converts the month into a periodic value to model cyclical patterns (seasons, etc.).
month_cos Cos transformation of the month feature. Works together with month_sin to capture the cyclical nature of months.
quarter_sin Sin transformation of the quarter feature. Captures the cyclic behavior of the four seasons in a year.
quarter_cos Cos transformation of the quarter feature. Works together with quarter_sin to capture the periodic nature of quarters.
meteo_temperature_avg_lag_1 Lag feature representing the average temperature from the previous year. This helps capture long-term temperature trends.
meteo_rain_height_lag_1 Lag feature representing the rainfall from the previous year. Similar to temperature lag, this captures long-term precipitation trends.
meteo_temperature_avg_rolling_mean_7 Rolling mean of the average temperature over a 7-day window. This smooths out short-term fluctuations and helps capture medium-term temperature trends.
meteo_rain_height_rolling_sum_7 Rolling sum of the rainfall over a 7-day window. Helps to capture cumulative rainfall over a short period.
temperature_wind_interaction Interaction feature between average temperature and wind speed. Helps to capture the joint effect of temperature and wind on environmental conditions.
humidity_rain_interaction Interaction feature between humidity and rainfall. Helps to understand how the two variables interact and affect the environment together.
temperature_range Difference between the maximum and minimum temperature. Captures the temperature variability within a day or over time.
evapotranspiration_to_rain_ratio Ratio of evapotranspiration to rainfall. Helps understand how the amount of water evaporated compares to the rainfall, influencing soil moisture.
altitude_difference Difference between the piezo station altitude and the meteorological station altitude. Helps to capture geographic effects on environmental conditions.
cumulative_rainfall_30_days Rolling sum of rainfall over a 30-day window. Captures long-term trends in precipitation.

๐Ÿ“Š Results

Model Accuracy F1 Score Precision Recall AUC-ROC
Random Forest 0.7149 ยฑ 0.0004 0.7212 ยฑ 0.0005 0.7231 ยฑ 0.0010 0.7199 ยฑ 0.0001 0.9222 ยฑ 0.0003
XGBoost 0.6261 ยฑ 0.0014 0.6349 ยฑ 0.0014 0.6352 ยฑ 0.0013 0.6349 ยฑ 0.0016 0.8821 ยฑ 0.0007
LightGBM 0.5851 ยฑ 0.0014 0.5928 ยฑ 0.0015 0.5925 ยฑ 0.0015 0.5940 ยฑ 0.0015 0.8592 ยฑ 0.0010
CNN 0.5223 ยฑ 0.0048 0.5227 ยฑ 0.0049 0.5241 ยฑ 0.0047 0.5223 ยฑ 0.0048 0.8342 ยฑ 0.0027
AdaBoost 0.3390 ยฑ 0.0020 0.3411 ยฑ 0.0025 0.3432 ยฑ 0.0032 0.3408 ยฑ 0.0018 0.6756 ยฑ 0.0007

๐Ÿ–ฅ๏ธ Run the code

Set up

First, clone the repository and navigate to the project folder:

git clone git@github.com:zhukovanadezhda/water-scarcity.git
cd water-scarcity

To set up the environment and install the required dependencies, use the following commands:

conda env create -f environment.yml
conda activate water-scarcity

Preprocessing

Download the data to the data folder (contact us to get the data). Then run this command to get the train and test datasets:

python scripts/preprocess_data.py --path <data_file_path> [--is_train]
    --path        Path to the CSV data file (training or test).
    --is_train    Flag to indicate training data (optional).

Models training and evaluation

After the preprocessing is completed, use one of two scripts train_cnn.py or train_models.py to train and evaluate corresponding models.

python scripts/train_cnn.py --X_path data/X_train.csv --y_path data/y_train.csv 
    --X_path      Path to the CSV file containing the training features.
    --y_path      Path to the CSV file containing the training labels.

๐Ÿค Acknowledgments

  • Hi! PARIS for organizing the Hi!ckathon and providing the opportunity to work on impactful sustainability challenges ๐ŸŽ‰
  • The participants, mentors, and organizers for their valuable feedback and support during the competition.

About

๐Ÿ”ฎ๐ŸŒ ML model predicting groundwater levels for French piezometric stations to tackle water shortage issues, built for the Hi!ckathon 2024 ๐Ÿšฑ๐Ÿ’ง

Topics

Resources

License

Stars

Watchers

Forks

Languages