- Remove PhyloPandas dependency
- Update dependencies in main branch and fix version issue in do_embedding.
- If you would like to try homolog retrieval benchmarks, please switch to v1 branch. v2 is built for large scale searches.
- Due to numerous reports on PhyloPandas issue (for quick fasta IO), it has currently been removed and currently only tsv file format is supported in main branch.
- Clone the repo
git clone https://github.com/heathcliff233/Dense-Homolog-Retrieval.git
- Go to the directory
cd Dense-Homolog-Retrieval
- Build using environment.yml
conda create --name fastMSA --file environment.yml -c pytorch -c conda-forge -c bioconda
- Activate the environment
conda activate fastMSA
Please download the checkpoints here and unzip. We will denote the absolute path to the checkpoint as $MODEL_PATH
If you would like a quick test with pre-built index or want to use esm1, please switch to v1 branch.
- Get the path to sequence database as
$SEQDB_PATH
(require tsv format) and path to output as $OUTPUT_PATH (The sequence database should be in tsv format) - Use
python3 do_embedding.py trainer.ur90_path=$SEQDB_PATH model.ckpt_path=$MODEL_PATH hydra.run.dir=$OUTPUT_PATH
to do embedding. Please note that$SEQDB_PATH
needs to be an absolute path. - Aggregate all the result using
python3 do_agg.py -s $SEQDB_PATH -e $OUTPUT_PATH/ebd -o $OUTPUT_PATH/agg
- For power users, please modify the settings in configuration to allow parallel embedding.
python3 do_retrieval.py
usage: do_retrieval.py [-h] [-i INPUT_PATH] [-d DATABASE_PATH] [-o OUTPUT_PATH] [-n NUM] [-r ITERS]
fastMSA do homolog retrieval.
optional arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
path of the tsv file containing query sequences
-d DATABASE_PATH, --database_path DATABASE_PATH
path of dir containing database embedding and db converted to DataFrame
-o OUTPUT_PATH, --output_path OUTPUT_PATH
path to output msas
-n NUM, --num NUM retrieve num
-r ITERS, --iters ITERS
num of iters by QJackHMMER
- input_path: put all query seqs into one tsv file
- output_path: output dir -- seq/db/res, seq subdir contain all queries, db contain retrieved db, res contain all results
- database_path: directory containing database in DataFrame and embedding saved in faiss index. All results produced in Offline Embedding section.
Install ColabFold
pip install -q --no-warn-conflicts 'colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold
Run batch prediction
colabfold_batch $MSA_DIR $PREDICTION_RES
If you find it useful, please cite our paper.
@article{Hong2024Aug,
author = {Hong, Liang and Hu, Zhihang and Sun, Siqi and Tang, Xiangru and Wang, Jiuming and Tan, Qingxiong and Zheng, Liangzhen and Wang, Sheng and Xu, Sheng and King, Irwin and Gerstein, Mark and Li, Yu},
title = {{Fast, sensitive detection of protein homologs using deep dense retrieval}},
journal = {Nat. Biotechnol.},
pages = {1--13},
year = {2024},
month = aug,
issn = {1546-1696},
publisher = {Nature Publishing Group},
doi = {10.1038/s41587-024-02353-6}
}