Image Data Coreset Selector

Uses ideas from Zero-Shot Coreset Selection paper from November 2024

In most real-world applications of deep learning in computer vision, a huge amount of image data is generated. In order to improve the performance of models, some of this raw (unlabeled) data needs to be selected, labeled and finally used to re-train the model.

However, when there are millions of raw images to choose from which do you pick in order to minimize labeling cost and time?

The high-level idea is to use existing foundation models to create embeddings for all images. These embeddings are then used to perform Monte Carlo-like sampling. This ensures that the selected subset, also called coreset, covers the embedding space well and evenly.

Usage

Put raw image data into folder (can contain sub-folders)
Run the main script

python main.py --image_dir <path/to/your/image/dir>

Install

python3 -m venv env
source env/bin/activate
pip install -e .

Future Extensions

OpenAI integration to use e.g. CLIP
Visualization of embedding and images

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!