Image Data Coreset Selector

Uses ideas from Zero-Shot Coreset Selection paper from November 2024

In most real-world applications of deep learning in computer vision, a huge amount of image data is generated. In order to improve the performance of models, some of this raw (unlabeled) data needs to be selected, labeled and finally used to re-train the model.

However, when there are millions of raw images to choose from which do you pick in order to minimize labeling cost and time?

The high-level idea is to use existing foundation models to create embeddings for all images. These embeddings are then used to perform Monte Carlo-like sampling. This ensures that the selected subset, also called coreset, covers the embedding space well and evenly.

Usage

Put raw image data into folder (can contain sub-folders)
Run the main script

python main.py --image_dir <path/to/your/image/dir>

Install

python3 -m venv env
source env/bin/activate
pip install -e .

Future Extensions

OpenAI integration to use e.g. CLIP
Visualization of embedding and images

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
image_data_selector		image_data_selector
.gitignore		.gitignore
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Data Coreset Selector

Usage

Install

Future Extensions

About

Releases

Packages

Languages

jonasdieker/image-data-selector

Folders and files

Latest commit

History

Repository files navigation

Image Data Coreset Selector

Usage

Install

Future Extensions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages