Skip to content

NVlabs/protcomposer

Repository files navigation

ProtComposer: Compositional Protein Structure Generation with 3D Ellipsoids

Oral at ICLR 2025

Hannes Stark*·Bowen Jing*·Tomas Geffner·Jason Yim·Tommi Jaakkola·Arash Vahdat·Karsten Kreis

*equal contribution

Paper


Animation Animation

Environment

We recommend installing Miniconda (https://docs.anaconda.com/miniconda/install/) and creating the following conda environment:

conda create -n nv python=3.9
pip install jupyterlab
pip install numpy==1.21.2 pandas==1.5.3
pip install torch==1.12.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install biopython==1.79 dm-tree==0.1.6 modelcif==0.7 ml-collections==0.1.0 scipy==1.7.1 absl-py einops
pip install pytorch_lightning==2.0.4 fair-esm
pip install 'openfold @ git+https://github.com/aqlaboratory/openfold.git@5484c38'
pip install matplotlib==3.7.2
pip install pydssp biotite omegaconf wandb
pip install numpy==1.21.2
# pip install numpy throws an errror for contourpy but that is fine
pip install torch-scatter -f https://data.pyg.org/whl/torch-1.12.1+cu113
pip3 install -U scikit-learn
pip install gpustat

Pretrained checkpoints

We provide two pretrained checkpoints in the directory model_weights. One is trained on data from the Protein Data Bank, the other on AlphaFold Database.

Description Checkpoint path
AlphaFold Database training data model_weights/trained_on_afdb.ckpt
Protein Data Bank training data model_weights/trained_on_pdb.ckpt

Sampling

To sample ProtComposer conditioned on ellipsoids from our ellipsoid statistical model:

python sample.py --guidance 1.0 --num_prots 6 --nu 5 --sigma 6 --helix_frac 0.4  --seed 1 --outdir results --num_blobs 9 --ckpt "model_weights/trained_on_pdb.ckpt"

Evaluation

We use these scripts to compute metrics for the outputs (evaluate_alignment computes the ellipsoid adherence metrics):

python -m scripts.evaluate_designability --dir results
python -m scripts.evaluate_alignment --dir results

Training

Data preparations

We use the data from MultiFlow: https://github.com/jasonkyuyim/multiflow

They host the datasets on Zenodo here. Download the following files, and place them into the directory data

  • real_train_set.tar.gz (2.5 GB)
  • synthetic_train_set.tar.gz (220 MB)
  • test_set.tar.gz (347 MB) Next, untar the files
# Uncompress training data
mkdir train_set
tar -xzvf real_train_set.tar.gz -C train_set/
tar -xzvf synthetic_train_set.tar.gz -C train_set/

# Uncompress test data
mkdir test_set
tar -xzvf test_set.tar.gz -C test_set/

Download the files and put them into the directory data such that you obtain the following file structure (the --pkl_dir argument is data by default):

data
├── train_set
│   ├── processed_pdb
|   |   ├── <subdir>
|   |   |   └── <protein_id>.pkl
│   ├── processed_synthetic
|   |   └── <protein_id>.pkl
├── test_set
|   └── processed
|   |   ├── <subdir>
|   |   |   └── <protein_id>.pkl
...

Launch training run

By default, we train on 8 GPUs.

python train.py --batch_size 8 --designability --designability_freq 5 --num_designability_prots 50 --accumulate_grad 8 --inf_batches 5 --val_batches 5 --finetune --dataset multiflow --self_condition --num_workers 10 --save_val --epochs 1000 --run_name my_run_name --wandb

License

Code and model weights are released under an NVIDIA license for non-commercial or research purposes only. Please see the LICENSE.txt file.

Citation

@inproceedings{stark2025protcomposer,
  title={ProtComposer: Compositional Protein Structure Generation with 3D Ellipsoids},
  author={Hannes Stark and Bowen Jing and Tomas Geffner and Jason Yim and Tommi Jaakkola and Arash Vahdat and Karsten Kreis},
  booktitle={The Thirteenth International Conference on Learning Representations (ICLR)},
  year={2025},
  url={https://openreview.net/forum?id=0ctvBgKFgc}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages