New and improved embedding models combining sequence and structure training are now available at https://github.com/tbepler/prose!
This repository contains the source code and links to the data and pretrained embedding models accompanying the ICLR 2019 paper: Learning protein sequence embeddings using information from structure
@inproceedings{
bepler2018learning,
title={Learning protein sequence embeddings using information from structure},
author={Tristan Bepler and Bonnie Berger},
booktitle={International Conference on Learning Representations},
year={2019},
}
Dependencies:
- python 3
- pytorch >= 0.4
- numpy
- scipy
- pandas
- sklearn
- cython
- h5py (for embedding script)
Run setup.py to compile the cython files:
python setup.py build_ext --inplace
The data sets with train/dev/test splits are provided as .tar.gz files from the links below.
The training and evaluation scripts assume that these data sets have been extracted into a directory called 'data'.
Our trained versions of the structure-based embedding models and the bidirectional language model can be downloaded here.
Tristan Bepler (tbepler@mit.edu)
Please cite the above paper if you use this code or pretrained models in your work.
The source code and trained models are provided free for non-commercial use under the terms of the CC BY-NC 4.0 license. See LICENSE file and/or https://creativecommons.org/licenses/by-nc/4.0/legalcode for more information.
If you have any questions, comments, or would like to report a bug, please file a Github issue or contact me at tbepler@mit.edu.