Authors: Sid Ahuja, Zackarya Hamza, Alexander Dawson
Demo of a data analysis project for DSCI 310 (Reproducible & Trustworthy workflows); a course in the Data Science faculty.
In this project, we build a prediction model using the k-nearest neighbours algorithm which attempts to categorize the quality of a wine based on its' physiochemical properties. We classify wine quality into a binary category: whether it is good or bad. Our classifier performed moderately well on the test set, but further research must be done to improve the model before it is put into production.
The dataset that we used for this project is about white variants of the Portugese "Vinho Verde" wine, which was assembled by Paulo Cortez, A. Cerdeira, F. Almeida, T.Matos, and J.Reis. The dataset was sourced from UCI Machine Learning Repository (Dua and Graff 2017), located here. Each row in this dataset showcases an observation of a white wine, specifically related to its physicochemical and sensory attributes.
Docker is a container solution used to manage the software dependencies for this project. The Docker image used for this project is based on the quay.io/jupyter/r-notebook:2024-03-14 image. Additional dependencies are specified in the Dockerfile.
Use the steps below to reproduce this analysis.
- Install and launch Docker on your computer.
- Clone this GitHub repository.
Navigate to the root of this project on your computer using the command line and enter the following command to reset the project to a clean state (i.e., remove all files generated by previous runs of the analysis):
docker-compose run --rm analysis-env make clean
To run the analysis in its entirety, enter the following command in the terminal in the project root:
docker-compose run --rm project-image make all
To work with the project and container in JupyterLab, use terminal to navigate to the root of this project and enter:
docker compose up
Look in the terminal for a URL that starting http://127.0.0.1:8888/lab?token= and copy/paste it into a browser. The JupyterLab IDE will load. Do not close the terminal while in use, otherwise you will lose your current session.
Clean up: type Ctrl + c in the terminal. Enter docker compose rm
in the terminal to remove the container.
To work with the project using just the terminal, navigate to the root of this project and enter:
docker compose run --rm analysis-env bash
To exit the container and clean up, enter exit
in the terminal.
To work in VSCode , open VSCode and launch the terminal from there. Navigate to the root of this project and enter:
docker compose run --rm analysis-env bash
To exit the container and clean up, enter exit
in the terminal.
The final report can be found here.
Our report is licensed under the MIT License. See LICENSE for additional information.
Cortez,Paulo, Cerdeira,A., Almeida,F., Matos,T., and Reis,J.. (2009). Wine Quality. UCI Machine Learning Repository. https://doi.org/10.24432/C56S3T.
CVRVV. (2024). Vinho Verde. https://www.vinhoverde.pt/en/homepage
Tiffany Timbers, Trevor Campbell. “Data Science.” Data Science, 23 Dec. 2023, datasciencebook.ca/.
Chester Ismay and Albert Y. Kim Foreword by Kelly S. McConville. “Statistical Inference via Data Science.” Statistical Inference via Data Science, 13 Feb. 2024, moderndive.com/index.html.