Update-March 2024: We provide a Demo for using USPNet-fast, which takes raw amino acid sequences as input. Tutorial video (in Chinese)
This repository contains code for the paper Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model, which is accepted by Nature Computational Science.
The full text of the paper can also be accessed via the view-only link.
You can use either USPNet or USPNet-fast to predict the signal peptide of a protein sequence.
First, download the repository and create the environment.
requirement
git clone https://github.com/ml4bio/USPNet.git
cd ./USPNet
conda env create -f ./environment.yml
All the data mentioned above can also be obtained from our OSF project.
USPNet prediction head (without organism group information).
USPNet-fast prediction head (without organism group information).
Specialized trained model optimized with higher accuracy on the major class (Sec/SPI). The model emphasizes the major class through an increased weight on the major class (Sec/SPI) in the objective function.
USPNet-fast prediction head (focus on Sec/SPI, require group information).
Put all the downloaded files into the same folder.
If you want to use USPNet on our benchmark set, please run:
# data processing, data_processed/ folder is created by default
python data_processing.py
#Please put MSA embedding into the data_processed/ folder
python predict.py
# categorical benchmark data
unzip test_data.zip
python test.py
Demo of USPNet on benchmark data without organism group information:
python predict.py --group_info no_group_info
python test.py no_group_info
Demo of USPNet-fast on benchmark data:
python predict_fast.py
python test_fast.py
Demo of USPNet on benchmark data without organism group information:
python predict.py --group_info no_group_info
python test_fast.py no_group_info
To generate MSA embeddings on your own protein sequences and use USPNet to perform signal peptide prediction, please run:
# MSA embedding generation. <data_directory_path>: Directory where the processed data will be saved. <msa_directory_path>: Directory for storing MSA files (.a3m).
python data_processing.py --fasta_file <fasta_file_path> --data_processed_dir <data_directory_path> --msa_dir <msa_directory_path>
# Prediction. use 'python predict.py --data_dir <data_directory_path> --group_info no_group_info' if lack of organism group information.
python predict.py --data_dir <data_directory_path>
If you want to use USPNet-fast to perform signal peptide prediction on your own protein sequences, please run:
# Data processing. Processed data is saved in data_processed/ by default.
python data_processing.py --fasta_file <fasta_file_path> --data_processed_dir <data_directory_path>
# Prediction. use 'python predict_fast.py --data_dir <data_directory_path> --group_info no_group_info' if lack of organism group information.
python predict_fast.py --data_dir <data_directory_path>
If you find the models useful in your research, please kindly cite our paper:
@article{shen2024unbiased,
title={Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model},
author={Shen, Junbo and Yu, Qinze and Chen, Shenyang and Tan, Qingxiong and Li, Jingchen and Li, Yu},
journal={Nature Computational Science},
volume={4},
number={1},
pages={29--42},
year={2024},
publisher={Nature Publishing Group US New York}
}