Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate 1/3 of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. In this code, i explore an alternative methodology based on deep learning that learns the relationship between unaligned amino acid sequences and their functional annotations across all 17929 families of the Pfam database.
My study focused on only 600 families out of all the families included in the dataset.
#Architecture | |
---|---|
Model |
(Training) Accuracy vs Validation Accuracy | (Training) Loss vs Validation Loss | |
---|---|---|
result |
pre-trainde model: https://drive.google.com/file/d/12ZsTkRlEPG8DL50Wb_tdDmHINv9pKTbj/view?usp=share_link
pre-trainde model weights: https://drive.google.com/file/d/1bj4uJBu7rbO6OaIZg--IkOC5yke_WiLn/view?usp=share_link
Tokenizer: https://drive.google.com/file/d/1-01g2VBsa6hMSCRB-DGylfffJDrCRXu4/view?usp=share_link