`datamodels` Tutorial:

Identifying Subgroups in Biomedical Datasets using Data Attribution

Understanding how training data influences model predictions ("data attribution") is an active area of machine learning research. In this tutorial, we will introduce a data attribution method (datamodels: https://gradientscience.org/datamodels-1/) and explore how it can be applied in the life sciences to identify meaningful subgroups in biomedical datasets, such as disease subtypes. We will begin with a simple example from image classification (CIFAR10), offering a step-by-step guide to demonstrate how the data attribution method works in practice. Since the approach involves training thousands of lightweight classifiers, we will focus on strategies for fast and efficient model training. Next, we will explore its applications in biomedical science, with a focus on single-cell and genetic datasets, highlighting the biological insights gained from applying this computational approach. The tutorial will conclude with an interactive, hands-on session using Google Colab, where participants can apply the techniques themselves and explore the approach further. This session is designed to be accessible to participants of all coding and machine learning experience levels—whether you're new to machine learning or curious about its intersection with biomedical applications.

Tutorial Materials

This Repository

This repository contains code to reproduce the example experiment given in the tutorial (i.e. datamodels.pt). For the purpose of this tutorial, this repository adapts code by the Madry lab (here) and relies on theory presented in Ilyas et al (here).

conda env create -f environment.yml --name ffcv
conda activate ffcv

pip install tqdm ffcv pyyaml fastargs ray torchvision fast_l1 notebook matplotlib
pip install "ray[tune]"

# install fastl1 from https://github.com/MadryLab/fast_l1

# optionally install ipykernel to use the notebook interface
conda install ipykernel
python -m ipykernel install --user --name=ffcv

1. Subset the CIFAR10 dataset

conda activate ffcv
python write_datasets.py --data.train_dataset ./CIFAR10/cifar10_train_subset_binaryLabels.beton \
                         --data.val_dataset ./CIFAR10/cifar10_val_subset_binaryLabels.beton \
                         --data.binary_labels True \
                         --data.subset_indices 25000 # subset the training set to 25k samples

2. Steps for Model Training and Tuning

Optional if you plan on using the same dataset and alpha as in the example, move on to step 3

Inspect the Dataloader (Optional)
Before starting training, you can inspect the dataloader by running the following notebook:

inspect_dataloader.ipynb

Verify Training
Ensure that the model training is functioning correctly by running the training notebook:

train_a_good_model.ipynb

Parameter Tuning for Alpha
To fine-tune your model parameters for a specific alpha value, use the notebook:

train_a_better_model.ipynb

Make sure you have the wandb library installed (pip install wandb)

3. Train many models on different training subsets

conda activate ffcv
sbatch launch_headnode.sh
sbatch train_cifar_with_ray.sh # once the headnode is running, update the address in this file before submitting the training jobs (found in the .out file for the launch_headnode job)

4. Fit datamodels

conda activate ffcv
sbatch train_datamodels.sh

5. Explore the datamodels embeddings

google collab; code no outputs

System Requirements used for this example

Python libraries

tqdm: 4.66.5
ffcv: 1.0.2
pyyaml: 6.0.2
fastargs: 1.2.0
ray: 2.37.0
torchvision: 0.19.0+cu118

Hardware

GPU: NVIDIA Tesla V100-PCIE-32GB
- Memory: 32 GB
- CUDA Capability: Required for running GPU-accelerated tasks.

Software

NVIDIA Driver Version: 535.183.01
CUDA Version: 12.2

Notes

Update the sbatch files according to your resource availability
You can adjust the number of simultaneous ray trials by modifying cpus_per_trial and gpus_per_trial parameters in the config file
Profile your GPU and CPU usage by running nvidia-smi -l or htop, respectively on your compute node

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
configs		configs
fast_l1		fast_l1
subgroups		subgroups
LICENSE		LICENSE
README.md		README.md
compute_datamodels.py		compute_datamodels.py
environment.yml		environment.yml
inspect_dataloader.ipynb		inspect_dataloader.ipynb
launch_headnode.sh		launch_headnode.sh
train_a_better_model.ipynb		train_a_better_model.ipynb
train_cifar_with_ray.py		train_cifar_with_ray.py
train_cifar_with_ray.sh		train_cifar_with_ray.sh
train_datamodels.sh		train_datamodels.sh
train_good_model.ipynb		train_good_model.ipynb
write_dataset_regression_with_concatenation.py		write_dataset_regression_with_concatenation.py
write_datasets.py		write_datasets.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`datamodels` Tutorial:

Identifying Subgroups in Biomedical Datasets using Data Attribution

Tutorial Materials

This Repository

1. Subset the CIFAR10 dataset

2. Steps for Model Training and Tuning

3. Train many models on different training subsets

4. Fit datamodels

5. Explore the datamodels embeddings

System Requirements used for this example

Python libraries

Hardware

Software

Notes

About

Releases

Packages

Languages

License

djunamay/datamodels_tutorial

Folders and files

Latest commit

History

Repository files navigation

datamodels Tutorial:

Identifying Subgroups in Biomedical Datasets using Data Attribution

Tutorial Materials

This Repository

1. Subset the CIFAR10 dataset

2. Steps for Model Training and Tuning

3. Train many models on different training subsets

4. Fit datamodels

5. Explore the datamodels embeddings

System Requirements used for this example

Python libraries

Hardware

Software

Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`datamodels` Tutorial:

Packages