Skip to content
/ USPNet Public

Nature Computational Science: Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model

License

Notifications You must be signed in to change notification settings

ml4bio/USPNet

Repository files navigation

USPNet

Update-March 2024: We provide a Demo for using USPNet-fast, which takes raw amino acid sequences as input. Tutorial video (in Chinese)

This repository contains code for the paper Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model, which is accepted by Nature Computational Science.

Overview

The full text of the paper can also be accessed via the view-only link.

You can use either USPNet or USPNet-fast to predict the signal peptide of a protein sequence.

Local Environment Setup for running the test

First, download the repository and create the environment.

Create an environment with conda

requirement

git clone https://github.com/ml4bio/USPNet.git
cd ./USPNet
conda env create -f ./environment.yml

Download the benchmark set

Test set.

Download categorical benchmark data

Categorical test data.

Download embeddings for the benchmark set

MSA embedding for test set.

All the data mentioned above can also be obtained from our OSF project.

Download trained predictive models

USPNet prediction head

USPNet prediction head (without organism group information).

USPNet-fast prediction head.

USPNet-fast prediction head (without organism group information).

Trained predictive model targeting the major class (Sec/SPI)

Specialized trained model optimized with higher accuracy on the major class (Sec/SPI). The model emphasizes the major class through an increased weight on the major class (Sec/SPI) in the objective function.

USPNet-fast prediction head (focus on Sec/SPI, require group information).

Usage

Put all the downloaded files into the same folder.

If you want to use USPNet on our benchmark set, please run:

# data processing, data_processed/ folder is created by default
python data_processing.py 
#Please put MSA embedding into the data_processed/ folder
python predict.py

# categorical benchmark data
unzip test_data.zip
python test.py

Demo of USPNet on benchmark data without organism group information:

python predict.py --group_info no_group_info

python test.py no_group_info

Demo of USPNet-fast on benchmark data:

python predict_fast.py

python test_fast.py

Demo of USPNet on benchmark data without organism group information:

python predict.py --group_info no_group_info

python test_fast.py no_group_info

To generate MSA embeddings on your own protein sequences and use USPNet to perform signal peptide prediction, please run:

# MSA embedding generation. <data_directory_path>: Directory where the processed data will be saved. <msa_directory_path>: Directory for storing MSA files (.a3m).
python data_processing.py --fasta_file <fasta_file_path> --data_processed_dir <data_directory_path> --msa_dir <msa_directory_path>

# Prediction. use 'python predict.py --data_dir <data_directory_path> --group_info no_group_info' if lack of organism group information.
python predict.py --data_dir <data_directory_path>

If you want to use USPNet-fast to perform signal peptide prediction on your own protein sequences, please run:

# Data processing. Processed data is saved in data_processed/ by default.
python data_processing.py --fasta_file <fasta_file_path> --data_processed_dir <data_directory_path>

# Prediction. use 'python predict_fast.py --data_dir <data_directory_path> --group_info no_group_info' if lack of organism group information.
python predict_fast.py --data_dir <data_directory_path>

Citations

If you find the models useful in your research, please kindly cite our paper:

@article{shen2024unbiased,
  title={Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model},
  author={Shen, Junbo and Yu, Qinze and Chen, Shenyang and Tan, Qingxiong and Li, Jingchen and Li, Yu},
  journal={Nature Computational Science},
  volume={4},
  number={1},
  pages={29--42},
  year={2024},
  publisher={Nature Publishing Group US New York}
}

About

Nature Computational Science: Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages