Development information
Date Created: October 30 2014
Last Update: Mar 4, 2016 by dgrewal
Date Created: October 30 2014
Developer: Diljot Grewal <dgrewal@bccrc.ca>
Input: bam
Output: params.txt, .RData, seg, segs.txt, segs.txt.pygenes, titan.txt
Version: 5.3
TITAN pipeline accepts a list of tumour-normal pair of BAM files as input and infers the clonal cluster of events along with their estimates of cellular prevalence, normal contamination and tumour ploidy. The pipeline follows these steps:
-
Identify germline heterozygous SNP positions in the matched normal BAM file. This step is represented by run_mutationseq_TASK_1 in the workflow
-
Extract the tumour allele read counts from the tumour BAM file at each of the germline heterozygous SNPs from Step 1. (Generates input file #1). This step is represented by run_mutationseq_TASK_1 and convert_museq_vcf2counts_TASK_2 in the workflow
-
Extract the tumour read depth from the tumour BAM file using HMMcopy suite. Correct GC content and mappability biases using HMMcopy R package. (Generates input file #2). This step is represented by the following tasks in the workflow:
-
run_readcounter_TASK_3,
-
run_readcounter_TASK_4,
-
calc_correctreads_wig_TASK_5
-
-
Run TitanCNA, including generating figures for chromosome plots. This step is represented by the following tasks in the workflow:
-
run_titan_TASK_6,
-
plot_titan_TASK_7,
-
calc_cnsegments_titan_TASK_8,
-
annot_pygenes_titan_TASK_9
-
The documentation for Kronos can be found here.
The pipeline takes a tab delimited file as input. The header of the file defines the keys and the each of the rows represents a value for these keys.
An input file for pipeline should resemble the following:
#sample_id tumour_id tumour_library_id tumour normal_id normal_library_id normal
SA123_A01234_SA123N_A01235 SA123 A01234 /path/to/SA123.bam SA123N A01235 /path/to/SA123N.bam
SA223_A01234_SA223N_A01235 SA223 A01234 /path/to/SA223.bam SA223N A01235 /path/to/SA223N.bam
The pipeline requires the following:
Softwares
Package/Program | Version * |
---|---|
python | 2.7.x |
mutationseq | 4.3.7 |
R | 3.1.x or higher |
-
python should have the following packages installed:
- sklearn 0.14.1 (Other versions are not supported)
- IntervalTree
- numpy (tested for version 1.7.1 and highly recommended to link against BLAS)
- scipy (tested for version 0.12.0)
- scikits-learn (tested for version 0.14.1)
- matplotlib (tested for version 1.2.1)
- bamtools (tested for version 2.3.0 but modified slightly to meet our needs. included with mutationseq.)
- boost (version 1.51.0 or higher)
-
R should have the following packages installed:
Installing mutationSeq:
Mutationseq relies on the pybam library which must be compiled before you can start running the pipeline. To check if the library is compatible with your python please follow the following steps:
cd /path/to/pipeline/components/run_mutationseq/component_seed/
python
>>> import pybam
An incompatible pybam library should generate an exception similar to the following:
ImportError: ./pybam.so: undefined symbol: PyUnicodeUCS4_FromEncodedObject
To recompile the pybam library follow the following steps:
cd /path/to/pipeline/components/run_mutationseq/component_seed/
rm -rf pybam.so
rm -rf build/
make BOOSTPATH=/path/to/boost PYTHON=python
The make command requires python to compile the library. It will use the default python for the system. Please ensure that the path to your python installation is added in the PATH variable. You can check if your python install is set propearly by running:
which python
The command should point to the python installation that will be used to run the pipeline. Mutationseq documentation can be found here
Mutationseq Models:
mutationseq uses different models for the paired and the single mode and are included with the mutationseq package. The models are pickled with python 2.7.* and sklearn 0.14.1 and should be loaded on a similar setup. The model compatibility can be checked in the python interpreter by running
python
>>>from sklearn.externals import joblib
>>>_ = joblib.load('/path/to/model.npz')
An incompatible model file will generate an exception similar to the following:
TypeError: __cinit__() takes exactly 3 positional arguments (8 given)
AttributeError: 'module' object has no attribute 'BestSplitter'
ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'int'
while an IOError exception would indicate an incorrect path.
Reference files and flags
In order to run the museq pipeline you will need to add the paths to the following data in the setup file:
- python: path to the python executable
- mutationseq: path to the mutationseq executable
- R: path to the R executable
- reference: path to the reference genome fasta file
- ld_library_path: specify ld_library_path for the python (set to None if the path is set properly)
- pythonpath: specify path to python's site-packages (set to None if the path is set properly)
- positions_file: path to the positions_file file
- map: path to the map file
- gc: path to the gc file
- gene_sets_gtf: path to the gene_sets_gtf file
- interval_file: path to the interval file (included with the pipeline)
- r_libs: specify R_LIBS for loading the R packages (set to None if set properly or if packages are installed globally)
- genome_type: specify the reference genome type (NCBI or UCSC)
- model: path to the mutationseq model file (model_single_v4.0.2.npz file, included with mutationseq)
- museq_interval_file: set to None if using the NCBI genome, specify path to the interval file included with the pipeline if running on UCSC aligned bam files
- y_threshold: threshold on the required number of calls in y-chromosome to consider it when running TITAN
- target_list: path to the target_list file (required if running on exomes)
- chromosomes: specify the target chromosomes for TITAN.
The output files will be saved in:
/path/to/output/directory/{run_id}/{sample_id}/outputs/
The Titan Pipeline generates the following output files:
- {sample_id}_outigv_[0-n].seg.pygenes * : Pygenes annotated IGV compatible segments
- titan_plots/ : Each data point for each of the tracks represent a germline heterzygous SNP loci in the TITAN analysis. There are 3 tracks generated for each plot
- Copy number alterations (log ratio)
- Loss of heterozygosity (allelic ratio)
- Cellular prevalence and clonal clusters)
* n depends on interval file
All final results are stored in the outputs/results/ directory.
- v5.3 fixed a bug in calc_optimal_clusters, updated titan parameter names
- v5.2 switched from pipeline factory to kronos
- v4.6 pipeline suggests an optimal cluster.
- v4.8 added support for new shahlab cluster
- v5.0 performance improvements
http://kronos.readthedocs.org/en/latest/ or contact dgrewal@bccrc.ca