Skip to content

RAATK: A Python-based Reduce Amino Acid ToolKit of machine learning for protein-dependent inference.

License

Notifications You must be signed in to change notification settings

lihaicheng7003/raatk

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAATK

RAATK: A Python-based reduce amino acid toolkit of machine learning for protein sequence level inference.

Installation

It is recommended to use pip for installation from github.

$ pip install git+https://github.com/huang-sh/raatk.git@master -U

or

$pip install raatk

All commands within paper can be tested by running demo.sh in demo directory after installing RAATK

$ ./demo.sh

Function


Command

$raatk view -t 9 -s 2 4 6 10 12 14 16 --visual

Output:

type9  2  IMVLFWY-GPCASTNHQEDRK                   BLOSUM50 matrix
type9  4  IMVLFWY-G-PCAST-NHQEDRK                 BLOSUM50 matrix
type9  6  IMVL-FWY-G-P-CAST-NHQEDRK               BLOSUM50 matrix
type9  10 IMV-L-FWY-G-P-C-A-STNH-QERK-D           BLOSUM50 matrix
type9  12 IMV-L-FWY-G-P-C-A-ST-N-HQRK-E-D         BLOSUM50 matrix
type9  14 IMV-L-F-WY-G-P-C-A-S-T-N-HQRK-E-D       BLOSUM50 matrix
type9  16 IMV-L-F-W-Y-G-P-C-A-S-T-N-H-QRK-E-D     BLOSUM50 matrix

view

reduce sequence according to built-in reduction alphabets. And the output is stored in directories.

$raatk reduce positive.txt negative.txt -t 1-8 -s 2-19 -o pos neg

reduce sequence according to specific amino acid cluster. The output result is in a single file.

$raatk reduce positive.txt -c IMV-L-FWY-G-P-C-A-STNH-QERK-D -o reduce_positive.txt

extract sequence features of directories, and the output is also stored in directories.

$raatk extract pos neg -k 3 -d -o k3 -m

extract sequence features of files, and the output is also stored in files.

$raatk extract pos/type9/4-IGPN.txt neg/type9/4-IGPN.txt -k 1 -o t9s4-k1.csv -m -raa IGPN

Output:

label,I,G,P,N
0.000000,0.125000,0.062500,0.562500,0.250000
0.000000,0.291667,0.166667,0.416667,0.125000
0.000000,0.277778,0.083333,0.416667,0.222222
                  ......
1.000000,0.177778,0.133333,0.377778,0.311111
1.000000,0.166667,0.000000,0.583333,0.250000
1.000000,0.387097,0.161290,0.322581,0.129032

And a feature file without label and the feature use

$raatk extract pos/type9/4-IGPN.txt -k 1 -o t9s4-k1p.csv -raa IGPN --count --label-f

Output:

I,G,P,N
2.000000,1.000000,9.000000,4.000000
7.000000,4.000000,10.000000,3.000000
10.000000,3.000000,15.000000,8.000000
                  ......

evaluate the performance of different alphabet clusters based on machine learning. And the output is a json file.

$raatk eval k3 -d -o k3-eval -clf svm -c 2 -g 0.5 -p 3

evaluate a single file.

$raatk eval k3/type2/10-ARNCQHIFPW.csv -cv -1 -c 2 -g 0.5 -o k3-t2s10.txt

output:

                        0                         
0   38  7
1   7  36

      tp   fn   fp   tn   recall  precision  f1-score  
  0   38    7    7   36    0.84     0.84       0.84    
  1   36    7    7   38    0.84     0.84       0.84    
acc                                            0.84
mcc                                            0.68
-------------------------------------------------------

result of json visualization

$raatk plot k3-eval.json -o k3p

output: plot

ROC evaluation

$raatk roc k3/type2/10-ARNCQHIFPW.csv -clf svm -cv 5 -c 2 -g 0.5 -o roc

output: roc

incremental feature selection

$raatk ifs k3/type2/10-ARNCQHIFPW.csv -s 2 -clf svm -cv 5 -c 2 -g 0.5 -o ifs

output: roc

train a classifier for prediction

$raatk train ifs_best.csv -clf svm -c 2 -g 0.5 -o svm.model -prob

predict new data using trained model. The new data must be feature file without label and feature extract parameter must be same as training feature.

$raatk predict new_data.csv -m svm.model -o 'test-result.csv'

split feature data into train and test subsets

$raatk split ifs_best.csv -ts 0.3 -o test_split.csv

transfer csv to arff for Weka.

$raatk transfer ifs_best.csv -fmt arff

Contact

If you have any problem, contact me with hsh-me@outlook.com.

About

RAATK: A Python-based Reduce Amino Acid ToolKit of machine learning for protein-dependent inference.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%