Open Air Quality Data Pipeline Project

This project demonstrates the creation of a beginner-friendly data pipeline using Open Air Quality data from a public S3 bucket. It aims to balance simplicity for newcomers with enough complexity to introduce important data engineering concepts. The pipeline extracts, transforms, and visualizes data in near real-time, ensuring the dashboard reflects live updates as data evolves.

OpenAQ S3 Archive Documentation: Learn more here
YouTube Tutorial: Watch the full tutorial here

Project Goals

Educational Focus:
- Introduce data engineering concepts without overwhelming beginners.
- Provide a hands-on learning experience with practical tools and workflows.
Tools and Technologies:
- Python: For scripting (CLI apps), orchestration, and dashboard creation using Plotly Dash.
- DuckDB: Lightweight, in-process database engine serving as the data warehouse.
End Result:
- A functional data pipeline that extracts, transforms, and visualizes air quality data dynamically.

Project Structure

notebooks/: Scratchpads for experimenting with ideas and testing technologies.
sql/: SQL scripts for data extraction and transformation, written in DuckDB’s query language.
pipeline/: CLI applications for executing extraction, transformation, and database management tasks.
dashboard/: Plotly Dash code for creating the live air quality dashboard.
locations.json: Configuration file containing air quality sensor locations.
secrets-example.json: Example configuration for OpenAQ API keys (Note: Do not commit actual secrets to version control).
requirements.txt: List of Python libraries and dependencies.

Database Structure

The DuckDB database includes the following schemas and tables:

raw schema:
- Contains a single table with all extracted data.
presentation schema:
- air_quality: The most recent version of each record per location.
- daily_air_quality_stats: Daily averages for parameters at each location.
- latest_param_values_per_location: Latest values for each parameter at each location.

Running the Project

Follow these steps to set up and run the project:

Set Up Python Environment:
- Create a virtual environment:
```
$ python -m venv .venv
```
- Activate the environment:
  - Windows: $ . .venv/Scripts/activate
  - Linux/Mac: $ . .venv/bin/activate
- Install dependencies:
```
$ pip install -r requirements.txt
```
Initialize the Database:
- Navigate to the pipeline directory:
```
$ cd pipeline
```
- Run the database manager CLI to create the database:
```
$ python database_manager.py --create
```

Extract Data:

Run the extraction CLI:

$ python extraction.py [required arguments]

Transform Data:
- Run the transformation CLI to create views in the presentation schema:
```
$ python transformation.py
```
Set Up the Dashboard:
- Navigate to the dashboard directory:
```
$ cd dashboard
```
- Start the dashboard application:
```
$ python app.py
```
Access the Results:
- The database will be stored as a .db file.
- The dashboard will be accessible in your web browser.

Additional Notes

Ensure Python 3.8+ is installed.
Replace placeholders (e.g., API keys) in secrets-example.json with your actual credentials.
Regularly update dependencies by running pip install --upgrade -r requirements.txt.

This project is designed to give you practical experience in building and managing a data pipeline. Happy learning!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Open Air Quality Data Pipeline Project

Project Goals

Project Structure

Database Structure

Running the Project

Additional Notes

Files

README.md

Latest commit

History

README.md

File metadata and controls

Open Air Quality Data Pipeline Project

Project Goals

Project Structure

Database Structure

Running the Project

Additional Notes