The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.
The Smith–Waterman algorithm has several steps:
-
Determine the substitution matrix and the gap penalty scheme. A substitution matrix assigns each pair of bases or amino acids a score for match or mismatch. A gap penalty function determines the score cost for opening or extending gaps.
-
Initialize the scoring matrix. The dimensions of the scoring matrix are 1+length of each sequence respectively. All the elements of the first row and the first column are set to 0. The extra first row and first column make it possible to align one sequence to another at any position, and setting them to 0 makes the terminal gap free from penalty.
-
Scoring. Score each element from left to right, top to bottom in the matrix, considering the outcomes of substitutions (diagonal scores) or adding gaps (horizontal and vertical scores). If none of the scores are positive, this element gets a 0. Otherwise the highest score is used and the source of that score is recorded.
-
Traceback. Starting at the element with the highest score, traceback based on the source of each score recursively, until 0 is encountered. The segments that have the highest similarity score based on the given scoring system is generated in this process.
To clone the repository:
git clone git@github.com:zhukovanadezhda/smith-waterman.git
To test the algorithm:
gcc -Wall projet.c -o pro ; ./pro <filename1> <filename2>
Where <filename1>
and <filename2>
are the .fasta files with sequences. Each file should include only one sequence.
In this project the Smith–Waterman algorithm was performed in C language.
The following functions were used:
int sequence_len(char *filename)
- calculates the length of the sequence
char *read_sequence(char *filename)
- reads a sequence from the file
int **read_substitution_matrix(char *filename)
- reads substitution matrix from the file
int **initialize_matrix(int n, int m, int value)
- initializes a score matrix
int compute_score(char *a, char *b, int **substitution_matrix)
- counts the score from the substitution matrix
int **fill_matrix(char *seq1, char *seq2, int **substitution_matrix, int gap_penalty)
- fills the score matrix
int calculate_alignment(char *seq1, char *seq2, int **substitution_matrix, int gap_penalty, char **aligned_seq1, char **aligned_seq2)
- calculates th alignment score and the aligned sequences
void print_seq(char* seq1, char* seq2)
- prints alignment with no more than 60 charecters per line
int main(int argc, char *argv[])
- takes filenames as arguments and executes some of functions mentioned above
To test the program two web resources were used. One for the short examples with visualisation of the score matrix and another one is a real Smith–Waterman algorithm from EMBOSS to align real sequences.
gcc -Wall projet.c -o pro ; ./pro examples/seq0.fasta examples/seq0.fasta
Alignment score: 20
Start position: 0, 0
End position: 5, 5
Alignment:
AAAAA
AAAAA
gcc -Wall projet.c -o pro ; ./pro examples/example1.fasta examples/example2.fasta
Alignment score: 23
Start position: 4, 1
End position: 9, 5
Alignment:
AWGHE
AW-HE
gcc -Wall projet.c -o pro ; ./pro examples/1AKE_1.fasta examples/1BTA_1.fasta
Alignment score: 48
Start position: 45, 6
End position: 109, 74
Alignment:
GKQAKDIMDAGKLVTDELVI-ALVKERI-AQEDCRNGFLLDGFPRTIP--QADAMKEAGI
GEQIRSISDLHQTLKKELALPEYYGENLDALWDCLTGW-VE-YPLVLEWRQFEQSKQLTE
N-VDYVLE-F
NGAESVLQVF
gcc -Wall projet.c -o pro ; ./pro examples/1K5D_3.fasta examples/1K5C_1.fasta
Alignment score: 46
Start position: 69, 201
End position: 193, 323
Alignment:
IFTGR-VKDEIPEALRLLLQALLKCP-KLHTVRLSDNAFGPTAQEPLIDFLSKHTPL-EH
IATGKHVSNVVIKG-NTVTRSMYGVRIKAQRTATSASVSGVTYDANTISGIAKYGVLISQ
LYLHNNGLGPQAGAKIARA-LQELAVNKKAKNAPPLRSIICGRNRLENGSMKEWAK-TFQ
SYPDDVG-NPGTGAPFSDVNFTGGATTIKVNNAATRVTVECG-N-C-SGNW-NWSQLTVT
SHRLLHTVK
GGK-AGTIK