ACTINN (Automated Cell Type Identification using Neural Networks) is a bioinformatic tool to quickly and accurately identify cell types in scRNA-Seq. For details, please read the paper: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz592/5540320 All datasets used in the paper are available here: https://figshare.com/articles/ACTINN/8967116
python 3.6
python packages: tensorflow 1.10+, numpy 1.14+, pandas 0.23+, argparse 1.1+, scipy 1.1+
We use HDF5 format for the scRNA-Seq expressiong matrix, which stores the compressed matrix and is fast to load. To convert the format, we first read the expression matrix as a pandas dataframe, then we use the to_hdf function to save the file as HDF5 format. For the to_hdf function, we use "dge", which stands for digital gene expression, for the key parameter.
python actinn_format.py -i input_file -o output_prefix -f format
- -i Path to the input file or the 10X directory
- -o Prefix of the output file
- -f Format of the input file (10X_V2, 10X_V3, txt, csv)
The output will be an HDF5 formated file named after the output prefix with ".h5" extension
python actinn_format.py -i ./test_data/train_set_10x -o train_set -f 10X_V2
python actinn_format.py -i ./test_data/train_set_10x -o train_set -f 10X_V3
python actinn_format.py -i ./test_data/train_set.txt.gz -o train_set -f txt
python actinn_format.py -i ./test_data/train_set.csv.gz -o train_set -f csv
We train a 4 layer (3 hidden layers) neural network on scRNA-Seq datasets with predifined cell types, then we use the trained parameters to predict cell types for other datasets.
python actinn_predict.py -trs training_set -trl training_label -ts test_set -lr learning_rat -ne num_epoch -ms minibatch_size -pc print_cost -op output_probability
- -trs Path to the training set, must be HDF5 format with key "dge".
- -trl Path to the training label (the cell types for the training set), must be tab separated text file with no column and row names.
- -ts Path to test sets, must be HDF5 format with key "dge".
- -lr Learning rate (default: 0.0001). We can increase the learning rate if the cost drops too slow, or decrease the learning rate if the cost drops super fast in the beginning and starts to fluctuate in later epochs.
- -ne Number of epochs (default: 50). The number of epochs can be determined by looking at the cost after each epoch. If the cost starts to decrease very slowly after ceartain epoch, then the "ne" parameter should be set to that epoch number.
- -ms Minibatch size (default: 128). This parameter can be set larger when training a large dataset.
- -pc Print cost (default: True). Whether to print cost after each 5 epochs.
- -op Output probabilities for each cell being the cell types in the training data (default: False).
The output will be a file named "predicted_label.txt". In the file, the first column will be the cell name, the second column will be the predicted cell type. If the "op" parameter is set to True, there will be another output file named "predicted_probablities.txt", where columns are cells and rows are cell types. The number in row i and column j will be the probablity that cell j being cell type i.
python actinn_predict.py -trs ./test_data/train_set.h5 -trl ./test_data/train_label.txt.gz -ts ./test_data/test_set.h5 -lr 0.0001 -ne 50 -ms 128 -pc True -op False
We show an example on how to create a tSNE plot with the predicted cell types. The R command can be found in the "tSNE_Example" folder.