This repository contains the code for the paper "PorTis: A biological phenotype classification dataset for benchmarking nanopore signal-based protein analysis". It contains code for loading, analyzing, and preprocessing the PorTis dataset as well as the tissue classification model training and inference code.
Authors: Cailin Winston, Marc Expòsit, Jeff Nivala
/src
contains core functionality for loading/analyzing data and for running benchmarksconstants.py
: constant definitions to load and save data fileskmedoids_mod.py
: modified version ofscikit-learn-extra
's k-medoids function to cluster signals and return centroidsknn_classifier.py
: functions required to train and test the kNN classifierknn_load_data.py
: functions to load and save signals, labels, and the distance matrix for the kNN classifiermetrics.py
: functions to calculate metrics of the CNN modelsmodels.py
: model architecture of the CNN classifierssignal_processing.py
: functions used to process signals before use in the classifiersutils_data.py
: functions used to load data for the CNN classifiersutils_models.py
: helper functions to train the CNN classifiers
/notebooks
contains interactive notebooks for the empirical analysis and clustering of the dataset1.0.DataAnalysis.ipynb
: Exploratory data analysis of signal number and length by replicate and tissue type2.0.DataPrepkNN.ipynb
: Balancing data across replicates and tissues to use in kNN classification2.1.DTWmatrixCalc.ipynb
: Calculating the distance matrix between signals using Dynamic Time Warping (DTW)3.0.ClustAllSignals.ipynb
: t-SNE representation of all signals4.0.kNNinform.ipynb
: kNN classification and identification of signals unique to each tissue5.0.InformativeSignalClustering.ipynb
: Analysis and clustering of informative signals
/scripts
contains python scripts to regenerate test indices and to run the CNN classification benchmarksgen_test_indices.py
: generate the indices to divide the data between training and testing sets for use in the CNN classifier1d_cnn.py
: training and inference of the 1D CNN classifier2d_cnn.py
: training and inference of the 2D CNN classifier
First clone the repository:
git clone git@github.com:uwmisl/PorTis.git
Then, setup the Python environment with the packages/dependencies needed to run the code in this repository:
cd PorTis
conda env create -f env.yml
Then, activate the conda environment:
conda activate portis
Ensure that you have Jupyter notebook to run the notebooks in /notebooks
.
To run any of the notebooks, open them in Jupyter notebook and follow the instructions/comments in the cells.
To run the benchmarks for the signal event classification (Task 1) and "real-time" sample classification (Task 2), run either 1d_cnn.py
or 2d_cnn.py
from the /scripts
directory.
Any combination of the flags --to_train
, --to_eval
, and --to_sim
can be specified, as long as at least one is specified. Running with the --to_train
flag will train a new model. Running with the --to_eval
flag with evaluate the model on the specified test set; either --to_train
must also be specified to train a new model to evaluate or a --model_path
to an already trained model can be provided. Running with the --to_sim
flag will run the "real-time" sample classification simulation.
python 1d_cnn.py [-h] [--base_dir BASE_DIR] [--train_test_split_id TRAIN_TEST_SPLIT_ID]
[--length_thresh LENGTH_THRESH] [--to_train] [--to_eval] [--to_sim] [--batch_size BATCH_SIZE]
[--lr LR] [--epochs EPOCHS] [--model_path MODEL_PATH]
python 2d_cnn.py [-h] [--base_dir BASE_DIR] [--train_test_split_id TRAIN_TEST_SPLIT_ID]
[--length_thresh LENGTH_THRESH] [--to_train] [--to_eval] [--to_sim][--batch_size BATCH_SIZE]
[--lr LR] [--epochs EPOCHS] [--rescale_dim RESCALE_DIM] [--stack_depth STACK_DEPTH]
[--model_path MODEL_PATH]