This is the repository for the DSCI 100 Research Project, view the report here (or here if GitHub preview does not work).
The most common way of linearizing images into a single-row feature vector is the zig-zag pattern linearization that simply concatenates an image matrix row-by-row. However, this linearization method entails the loss of proximity features, i.e. pixels that are originally close together, forming certain local features, will no longer be in each other's vicinity after linearization. However, in practice, not preserving these local features might just be one of the reasons why the K-NN classification model is insufficient in practice. K-NN algorithm uses a distance function, which sees features (pixels) as independent values, in conjunction with the majority principle to determine the class of some unseen data and makes no attempt to learn from the underlying patterns among the pixels, ungracefully disregarding the structure of the data (e.g. how pixels are arranged and collectively form macroscopic features, etc.). This renders the K-NN algorithm noise-sensitive and easily gullible, because even just small perturbations, imperceptible to human vision, can greatly impact how it classifies images: it will be extremely difficult for the model to clearly distinguish two images of the same object that differ a lot from one another for almost every pixel (e.g. different lighting, different perspectives, etc.). We have already seen from practice that the performance of a K-NN classification model on the small MNIST dataset still have a long way to go before it becomes practicable. Having inspired by research attempting to implement local feature preservation with K-NN models (Amato & Falchi, 2010), we reckon that, without compromising the intuitiveness of the K-NN algorithm, choosing a linearization method that preserves local features for the K-NN classification model could potentially improve its performance in a Greek Alphabet classification task. Out of all the space-filling curves, Hilbert curve is chosen for its phenomenal locality-preserving capability (Moon et al., 2001): most, if not all, pairs of pixel points on an image mapped by a Hilbert curve will remain close after the image matrix is straightened into a one-dimensional feature vector.
Whether linearizing images with Hilbert Curve would improve the performance of a K-NN model on a Greek alphabet classification task.
This is a 24-class balanced handwritten Greek Letters dataset. It consists of 24 classes (24 Greek letters) with 240 training images (10 for each class) and 96 testing images (4 for each class). The images are greyscale and clearly legible: black pen strokes on white backgrounds. The low-resolution images (14x14) will be used for the classification task. The last column of the two csv
files are the ground truths. Since mapping with Hilbert curve requires images of the size
We would like to see if training K-NN classification models with images linearized with Hilbert curve would significantly and consistently improve the model's performance for all
- Regular set (control): all images within the set are regularly linearized using the classic zig-zag mapping.
- Hilbert set (experimental): all images within the set are mapped and linearized using a 4-th order Hilbert curve.
To avoid redundancy, the two conditions/models will be referred to as the Baseline Model/Condition (regular linearization) and the HC Model/Condition (Hilbert curve linearization) respectively thereinafter.
All feature vectors (images) in both data sets will be compressed such that every adjacent feature are averaged to form a new feature vector of the size one less than that of its original, embeding the order information within the data.
With a stratified 5-fold cross validation, we will use these two data sets to train and test two separate instances of the K-NN classification model respectively, all the while resampling with varying
All random processes, including but not limited to data augmentation and cross validation split, will be made deterministic and reproducible by setting an arbitrarily-chosen global seed (2023
) to eradicate test results variability due to randomness and ensure that we are always taking the same sample from the population.
Some of the details will be further elaborated.
Parameter of Interest: The population mean difference in validation accuracies (denoted by
Null Hypothesis:
There is no significant difference in population mean difference of validation accuracies of the Baseline Model and the HC Model.
Alternative Hypothesis:
There is significant improvement in the model's validation accuracies under the HC Condition compared to the Baseline Condition.
The final accuracies will be paired recorded in a 50x4 table as follows:
Assuming that observations are randomly drawn from the population, dependent within-pair, and independent pair-to-pair, and that the sample size (
- An accuracy vs.
$K$ line plot will be plotted with both conditions on the same plot to not only show the change of accuracy in response to different$K$ 's in both conditions but also compare the performance of the two models at different$K$ 's. - A histogram with an overlayed density line (Kernel Density Estimation) to show the distribution of the accuracy differences.