This project focuses on protein structure prediction using the threading (fold recognition) method. Threading is a computational approach that aligns a protein sequence with known template structures to predict its 3D structure, even when sequence similarity is low. This method identifies candidate templates by considering structural similarities, such as predicted secondary structures and solvent accessibility. By mapping the sequence onto structurally similar templates, threading can accurately predict protein folds, making it a valuable tool in bioinformatics, medicine, and biotechnology. The algorithm is inspired by David Jones's THREADER (Computational Methods in Molecular Biology, Chapter 13, 1998).
Fig. 1: Protein fold recognition summary.
git clone git@github.com:zhukovanadezhda/protein-threading.git
cd protein-threading
Install miniconda. Create the protein-threading
conda environment:
conda env create -f environment.yml
conda activate protein-threading
💡Note: To deactivate an active environment, use:
conda deactivate
💡Note: Before running the program, ensure the
src/config.py
file is properly configured to set up your working directories.
- To run an example: Select one of the provided example directories (e.g.,
data/example1
).- To use custom data: Create your own directory (e.g.,
data/your_dir
) with subdirectories forsequences
andstructures
. Place your data in these folders and update the paths insrc/config.py
.Example
src/config.py
modification:# Directory paths TEMPLATES_DIR = 'data/your_dir/structures/' SEQUENCES_DIR = 'data/your_dir/sequences/'
To run the program, use the following command:
python src/main.py [-h] [--sequences SEQUENCES] [--templates TEMPLATES] [--output_file OUTPUT_FILE] \
[--jobs JOBS] [--dry_run] [--verbose]
Argument | Description | Default |
---|---|---|
-h |
Show a help message and exit. | |
--sequences |
Comma-separated list of sequence filenames (.fasta format). |
All files from SEQUENCES_DIR from src/config.py . |
--templates |
Comma-separated list of template filenames (.pdb format). |
All files from TEMPLATES_DIR from src/config.py . |
--gap_score |
The gap penalty. | 0 |
--output_file |
Name of the output CSV file. | results/energy_scores.csv |
--jobs |
Number of parallel jobs to run. | All cores |
--dry_run |
If set, only log actions without processing. | False (not set) |
--verbose |
If set, verbose output is enabled. | False (not set) |
--print_alignments |
If set, the alignments are printed. | False (not set) |
Table 1: Program parameters.
Estimated execution time: Less than 35 minutes
src/config.py
:# Directory paths TEMPLATES_DIR = 'data/example1/structures/' SEQUENCES_DIR = 'data/example1/sequences/'
python src/main.py --sequences 1CRN.fasta,1L2Y.fasta,1VII.fasta,5AWL.fasta --gap_score 0.2 \
--output_file results/example1_result.csv --print_alignments
2024-09-17 08:10:16,722 - INFO - Loading DOPE score data...
2024-09-17 08:11:47,306 - INFO - Processing sequences and templates...
2024-09-17 08:11:47,306 - INFO - Processing sequence 5AWL.fasta, length: 10
2024-09-17 08:11:50,621 - INFO - Processing template 5awl.pdb with 10 residues.
2024-09-17 08:11:50,444 - INFO - Processing template 1le0.pdb with 12 residues.
0 1 2 3 4 5 6 7 8 9
| | | | | | | | | |
Y Y Y D P E T G T W
2024-09-17 08:11:57,940 - INFO - Processed template 5awl.pdb. Energy score: -32.68
0 1 2 3 4 5 6 7 8 9 10 11
| | | | | | | | | |
Y Y Y D P E T G - T W -
2024-09-17 08:12:00,712 - INFO - Processed template 1le0.pdb. Energy score: -28.12
...
Table 2: Summary of results from the first example.
"aa" stands for amino acids; bold indicates the best score, italics indicate the correct match.
Estimated execution time: Over 16 hours
src/config.py
:# Directory paths TEMPLATES_DIR = 'data/example2/structures/' SEQUENCES_DIR = 'data/example2/sequences/'
python src/main.py --sequences 1E68.fasta,3E8V.fasta --gap_score 0.1 --output_file results/example2_result.csv
Table 3: Summary of results from the second example.
"aa" stands for amino acids; bold indicates the best score, italics indicate the correct match.
To determine whether the results obtained are significantly different from those expected by chance, a statistical approach based on the calculation of z-scores is used. The energy scores observed for each alignment can be compared to a distribution generated by a random shuffle of the sequences. The z-score quantifies the deviation from the mean of the random scores: a z-score lower than -1.96
indicates, with 95% confidence, a significant structural match, while a z-score higher than 1.96
signals an alignment significantly worse than chance. A z-score close to zero suggests the absence of a significant deviation.
💡Note: Before running the program, ensure the
src/config.py
file is properly configured with the correct paths for your working directories.
The working directories should correspond to those where the sequence and template files referenced in the energy scores file are stored.
python src/evaluate_significance.py --input_csv <path_to_input_csv> --output_file <path_to_output_csv> \
--gap_score <gap_score> --n_shuffles <number_of_shuffles>
input_csv
- the input csv file with calculated energy scores, the file should be in the format produced by the main script described above.
src/config.py
:# Directory paths TEMPLATES_DIR = 'data/example1/structures/' SEQUENCES_DIR = 'data/example1/sequences/'
python src/evaluate_significance.py --input_csv results/example1_result.csv \
--output_file results/shuffle_example1.csv \
--n_shuffle 10 --gap_score 0.2
WARNING - The number of shuffles is less than 30. Shapiro-Wilk test will be performed to check for the normality of the shuffled scores.
WARNING - The distribution of shuffled energy scores for sequence 5AWL.fasta is not normally distributed (p-value = 0.0000).
...
|
|
|
|
|
|
|
---|---|---|---|---|---|---|
|
|
0.68 |
0.28 |
-0.91 |
1.22 |
-0.39 |
|
1.12 |
|
2.96 |
2.62 |
1.69 |
1.66 |
|
-0.16 |
0.18 |
-1.05 |
-1.40 |
-1.48 |
|
|
-0.97 |
-0.16 |
-1.32 |
-1.53 |
-1.19 |
-1.56 |
Table 4: Summary of z-scores for alignments between different structures in example 1.
⚠️ Warning: The z-scores may not be interpretable as the shuffled energy scores are not normally distributed. This is merely an example. To perform a meaningful significance evaluation, increase the number of shuffles.
To explore the impact of gap penalties on sequence-structure matching performance, you can use the test_gaps.py
script. This script evaluates several gap penalty values to determine the optimal setting for the algorithm. Performance is assessed based on the following scoring system:
- Correct Structure Prediction: If the algorithm accurately predicts the structure of the sequence, it earns 2 points.
- Similar Structure Bonus: If a similar structure appears among the top two predictions, the algorithm earns an additional 1 point.
- Thus, the maximum score for each sequence is 3 points, and the final score is normalized by dividing it by the maximum possible score.
💡 Note: Before running the program, ensure that
src/config.py
is properly configured. Set the paths for your working directories, the gap penalties to test, and similar structures to evaluate performance.Example modification to
src/config.py
:# Gap penalties to test gap_scores = [0, 0.1, 0.2] # Proteins with similar structures homolog_pairs = { 'A.fasta': ['b.pdb'], 'B.fasta': ['a.pdb', 'c.pdb'], 'C.fasta': ['b.pdb'] }
python src/test_gaps.py [--program_path PROGRAM_PATH] [--output_dir OUTPUT_DIR] [--result_file RESULT_FILE]
src/config.py
:# Gap scores to test gap_scores = [0, 0.1, 0.2, 0.3, 0.5, 1, 2, 5] # Proteins with similar structures homolog_pairs = { '5AWL.fasta': ['1l2y.pdb', '1vii.pdb', '1crn.pdb'], '1VII.fasta': ['1l2y.pdb', '1crn.pdb'], '1L2Y.fasta': ['1vii.pdb', '1crn.pdb'], '1CRN.fasta': ['1l2y.pdb','1vii.pdb', '1crn.pdb', '1le0.pdb', '1le1.pdb'] }
python src/test_gaps.py --output_dir results/gaps_test --result_file results/performance.csv
|
|
|
|
---|---|---|---|
0.0 |
0.5 |
1 |
4 |
0.1 |
0.58 |
2 |
3 |
0.2 |
0.92 |
4 |
3 |
0.3 |
0.92 |
4 |
3 |
0.5 |
0.83 |
4 |
2 |
1.0 |
0.75 |
4 |
1 |
2.0 |
0.75 |
4 |
1 |
5.0 |
0.75 |
4 |
1 |
Table 5: Summary of algorithm performance based on gap penalty.
JONES, D. Threader: protein sequence threading by double dynamic programming. Computational Methods in Molecular Biology. Elsevier, 1996. v. 32, cap. 13, p. 312–338
For questions or issues, please open an issue on GitHub or contact nadiajuckova@gmail.com.