ProtComposer: Compositional Protein Structure Generation with 3D Ellipsoids

Oral at ICLR 2025

Hannes Stark* · Bowen Jing* · Tomas Geffner · Jason Yim · Tommi Jaakkola · Arash Vahdat · Karsten Kreis

*equal contribution

Paper

Environment

We recommend installing Miniconda (https://docs.anaconda.com/miniconda/install/) and creating the following conda environment:

conda create -n nv python=3.9
pip install jupyterlab
pip install numpy==1.21.2 pandas==1.5.3
pip install torch==1.12.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install biopython==1.79 dm-tree==0.1.6 modelcif==0.7 ml-collections==0.1.0 scipy==1.7.1 absl-py einops
pip install pytorch_lightning==2.0.4 fair-esm
pip install 'openfold @ git+https://github.com/aqlaboratory/openfold.git@5484c38'
pip install matplotlib==3.7.2
pip install pydssp biotite omegaconf wandb
pip install numpy==1.21.2
# pip install numpy throws an errror for contourpy but that is fine
pip install torch-scatter -f https://data.pyg.org/whl/torch-1.12.1+cu113
pip3 install -U scikit-learn
pip install gpustat

Pretrained checkpoints

We provide two pretrained checkpoints in the directory model_weights. One is trained on data from the Protein Data Bank, the other on AlphaFold Database.

Description	Checkpoint path
AlphaFold Database training data	`model_weights/trained_on_afdb.ckpt`
Protein Data Bank training data	`model_weights/trained_on_pdb.ckpt`

Sampling

To sample ProtComposer conditioned on ellipsoids from our ellipsoid statistical model:

python sample.py --guidance 1.0 --num_prots 6 --nu 5 --sigma 6 --helix_frac 0.4  --seed 1 --outdir results --num_blobs 9 --ckpt "model_weights/trained_on_pdb.ckpt"

Evaluation

We use these scripts to compute metrics for the outputs (evaluate_alignment computes the ellipsoid adherence metrics):

python -m scripts.evaluate_designability --dir results
python -m scripts.evaluate_alignment --dir results

Training

Data preparations

We use the data from MultiFlow: https://github.com/jasonkyuyim/multiflow

They host the datasets on Zenodo here. Download the following files, and place them into the directory data

real_train_set.tar.gz (2.5 GB)
synthetic_train_set.tar.gz (220 MB)
test_set.tar.gz (347 MB) Next, untar the files

# Uncompress training data
mkdir train_set
tar -xzvf real_train_set.tar.gz -C train_set/
tar -xzvf synthetic_train_set.tar.gz -C train_set/

# Uncompress test data
mkdir test_set
tar -xzvf test_set.tar.gz -C test_set/

Download the files and put them into the directory data such that you obtain the following file structure (the --pkl_dir argument is data by default):

data
├── train_set
│   ├── processed_pdb
|   |   ├── <subdir>
|   |   |   └── <protein_id>.pkl
│   ├── processed_synthetic
|   |   └── <protein_id>.pkl
├── test_set
|   └── processed
|   |   ├── <subdir>
|   |   |   └── <protein_id>.pkl
...

Launch training run

By default, we train on 8 GPUs.

python train.py --batch_size 8 --designability --designability_freq 5 --num_designability_prots 50 --accumulate_grad 8 --inf_batches 5 --val_batches 5 --finetune --dataset multiflow --self_condition --num_workers 10 --save_val --epochs 1000 --run_name my_run_name --wandb

License

Code and model weights are released under an NVIDIA license for non-commercial or research purposes only. Please see the LICENSE.txt file.

Citation

@inproceedings{stark2025protcomposer,
  title={ProtComposer: Compositional Protein Structure Generation with 3D Ellipsoids},
  author={Hannes Stark and Bowen Jing and Tomas Geffner and Jason Yim and Tommi Jaakkola and Arash Vahdat and Karsten Kreis},
  booktitle={The Thirteenth International Conference on Learning Representations (ICLR)},
  year={2025},
  url={https://openreview.net/forum?id=0ctvBgKFgc}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
model_weights		model_weights
proteinblobs		proteinblobs
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
multiflow_config.yaml		multiflow_config.yaml
run_multiflow.py		run_multiflow.py
sample.py		sample.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProtComposer: Compositional Protein Structure Generation with 3D Ellipsoids

Oral at ICLR 2025

Environment

Pretrained checkpoints

Sampling

Evaluation

Training

Data preparations

Launch training run

License

Citation

About

Releases

Packages

Contributors 2

Languages

License

NVlabs/protcomposer

Folders and files

Latest commit

History

Repository files navigation

ProtComposer: Compositional Protein Structure Generation with 3D Ellipsoids Oral at ICLR 2025

Environment

Pretrained checkpoints

Sampling

Evaluation

Training

Data preparations

Launch training run

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

ProtComposer: Compositional Protein Structure Generation with 3D Ellipsoids

Oral at ICLR 2025

Packages