Skip to content

Commit

Permalink
doc
Browse files Browse the repository at this point in the history
  • Loading branch information
Li Tai Fang committed Jul 2, 2018
1 parent f46da8a commit aea6750
Show file tree
Hide file tree
Showing 3 changed files with 184 additions and 39 deletions.
Binary file modified docs/Manual.pdf
Binary file not shown.
37 changes: 18 additions & 19 deletions docs/Manual.tex
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@



\section{Introduction}
\section{Introduction} \label{Introduction}

SomaticSeq is a flexible post-somatic-mutation-calling workflow for improved accuracy. We have incorporated multiple somatic mutation caller(s) to obtain a combined call set, and then it uses machine learning to distinguish true mutations from false positives from that call set. We have incorporated the following somatic mutation caller: MuTect/Indelocator/MuTect2, VarScan2, JointSNVMix, SomaticSniper, VarDict, MuSE, LoFreq, Scalpel, Strelka, and TNscope. You may incorporate some or all of those callers into your own pipeline with SomaticSeq.

Expand Down Expand Up @@ -126,14 +126,14 @@ \subsection{Docker repos}
\end{itemize}


\section{To use SomaticSeq.Wrapper.sh}
\section{To use SomaticSeq.Wrapper.sh} \label{Wrapper_script}

The SomaticSeq.Wrapper.sh is a wrapper script that calls a series of programs and procedures \textbf{after} you have run your individual somatic mutation callers. In the next section, we will describe the workflow in more detail, so you may not be dependent on this wrapper script. You can either modify this wrapper script or create your own workflow.
The SomaticSeq.Wrapper.sh is a wrapper script that calls a series of programs and procedures \textbf{after} you have run your individual somatic mutation callers. Section \ref{dockerized_callers} will teach you how to run those callers that we have dockerized, plus ways to create semi-simulated training data. In next section, we will describe the workflow in this wrapper script in detail, so you may not be dependent on this wrapper script. You can either modify this wrapper script or create your own workflow using whatever workflow language you want.


\subsection{To train data set into a classifier}

To create a trained classifier, ground truth files are required for the data sets. There is also an option to include a list of regions to include and/or exclude. The exclusion or inclusion regions can be VCF or BED files. An inclusion region may be subset of the call sets where you have validated their true/false mutation status, so that only those regions will be used for training. An exclusion region can be regions where the ``truth'' is ambigious.
To create SomaticSeq classifiers, ground truth VCF files are required for the data sets. There is also an option to include a list of regions to include and/or exclude. The exclusion or inclusion regions can be VCF or BED files. An inclusion region may be subset of the call sets where you have validated their true/false mutation status, so that only those regions will be used for training. An exclusion region can be regions where the ``truth'' is ambigious.

All the output VCF files from individual callers are optional. Those VCF files can be bgzipped if they have .vcf.gz extensions. It is imperative that the parameters used for the training and prediction are identical.

Expand Down Expand Up @@ -225,15 +225,16 @@ \subsection{To classify based on simple majority vote}



\section{The step-by-step SomaticSeq Workflow}
\section{The step-by-step SomaticSeq Workflow} \label{step_by_step_procedures}

We'll describe the workflow here, so you may modify the workflow and/or create your own workflow instead of using the wrapper script described previously.


\subsection{Combine the call sets}
We use utilities/getUniqueVcfPositions.py and vcfsorter.pl to combine the VCF files from different callers. The intermediate VCF files were modified to separate SNVs and INDELs.
We use utilities/getUniqueVcfPositions.py and vcfsorter.pl to combine the VCF files from different callers. The intermediate VCF files were modified to separate SNVs and INDELs.

VCF modifications were previously done to primarily make them compatible with GATK CombineVariants. We no longer use CombineVariants, so some of them are legacy, but this step also serves to remove some REJECT calls from some callers, in order to reduce their sizes.

VCF modifications were previously done to primarily make them compatible with GATK CombineVariants. We no longer depend on CombineVariants, so some of the steps are legacy.
\begin{enumerate}

% MuTect and Indelocator
Expand Down Expand Up @@ -354,13 +355,11 @@ \subsection{Combine the call sets}

% GATK CombineVariants
\item
Finally, with the VCF files modified, you may combine them with GATK CombineVariants: one for SNV and one for indel separately. There is no particular reason to use GATK CombineVariants. Other combiners should also work. The only useful thing here is to combine the calls and GATK CombineVariants does it well and pretty fast.
Finally, with the VCF files modified, you need combine them: one for SNV and one for indel separately.

\begin{lstlisting}
# Combine the VCF files for SNV. Any or all of the VCF files may be present.
# -nt 6 means to use 6 threads in parallel
java -jar $PATH/TO/GenomeAnalysisTK.jar -T CombineVariants -R genome.GRCh37.fa -nt 6 --setKey null --genotypemergeoption UNSORTED -V mutect.vcf -V varscan.snp.vcf -V jointsnvmix.vcf -V snp.vardict.vcf -V muse.vcf --out CombineVariants.snp.vcf
java -jar $PATH/TO/GenomeAnalysisTK.jar -T CombineVariants -R genome.GRCh37.fa -nt 6 --setKey null --genotypemergeoption UNSORTED -V indelocator.vcf -V varscan.snp.vcf -V indel.vardict.vcf --out CombineVariants.indel.vcf
utilities/getUniqueVcfPositions.py -vcfs mutect.vcf varscan.snp.vcf jointsnvmix.vcf snp.vardict.vcf muse.vcf -out CombineVariants.snp.vcf
\end{lstlisting}


Expand Down Expand Up @@ -458,11 +457,11 @@ \subsection{Model Training or Mutation Prediction}
Model training:
\begin{lstlisting}
# Training:
r_script/ada_model_builder.R Ensemble.sSNV.tsv
r_script/ada_model_builder.R Ensemble.sINDEL.tsv
r_scripts/ada_model_builder_ntChange.R Ensemble.sSNV.tsv Consistent_Mates Inconsistent_Mates
r_scripts/ada_model_builder_ntChange.R Ensemble.sINDEL.tsv Strelka_QSS Strelka_TQSS Consistent_Mates Inconsistent_Mates
\end{lstlisting}

Ensemble.sSNV.tsv.Classifier.RData and Ensemble.sINDEL.tsv.Classifier.RData will be created from model training.
Ensemble.sSNV.tsv.Classifier.RData and Ensemble.sINDEL.tsv.Classifier.RData will be created from model training. The arguments after Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv tells the builder script to ignore those features in training. These features do not improve accuracy in our data sets (mostly WGS data, but they may help other data sets)


Mutation prediction:
Expand All @@ -474,22 +473,22 @@ \subsection{Model Training or Mutation Prediction}
\end{lstlisting}


After mutation prediction, if you feel like it, you may convert Trained.sSNV.tsv and Trained.sINDEL.tsv into VCF files. Use -tools to list ONLY the individual tools used to have appropriately annotated VCF files. Accepted tools are CGA (for MuTect/Indelocator), VarScan2, JointSNVMix2, SomaticSniper, VarDict, MuSE, LoFreq, and/or Scalpel. To list a tool without having run it, the VCF will be annotated as if the tool was run but did not identify that position as a somatic variant.
After mutation prediction, if you feel like it, you may convert Trained.sSNV.tsv and Trained.sINDEL.tsv into VCF files. Use -tools to list ONLY the individual tools used to have appropriately annotated VCF files. Accepted tools are MuTect2/MuTect/Indelocator, VarScan2, JointSNVMix2, SomaticSniper, VarDict, MuSE, LoFreq, Scalpel, Strelka, and/or TNscope. To list a tool without having run it, the VCF will be annotated as if the tool was run but did not identify that position as a somatic variant, which is probably undesireable.


\begin{lstlisting}
# Probability above 0.7 labeled PASS (-pass 0.7), and between 0.1 and 0.7 labeled LowQual (-low 0.1):
# Use -all to include REJECT calls in the VCF file
# Use -phred to convert probability values (between 0 to 1) into Phred scale in the QUAL column in the VCF file

SSeq_tsv2vcf.py -tsv Trained.sSNV.tsv -vcf Trained.sSNV.vcf -pass 0.7 -low 0.1 -tools CGA VarScan2 JointSNVMix2 SomaticSniper VarDict MuSE LoFreq Strelka -all -phred
SSeq_tsv2vcf.py -tsv Trained.sSNV.tsv -vcf Trained.sSNV.vcf -pass 0.7 -low 0.1 -tools MuTect2 VarScan2 JointSNVMix2 SomaticSniper VarDict MuSE LoFreq Strelka -all -phred

SSeq_tsv2vcf.py -tsv Trained.sINDEL.tsv -vcf Trained.sINDEL.vcf -pass 0.7 -low 0.1 -tools CGA VarScan2 VarDict LoFreq Scalpel Strelka -all -phred
SSeq_tsv2vcf.py -tsv Trained.sINDEL.tsv -vcf Trained.sINDEL.vcf -pass 0.7 -low 0.1 -tools MuTect2 VarScan2 VarDict LoFreq Scalpel Strelka -all -phred
\end{lstlisting}



\section{To run the dockerized somatic mutation callers}
\section{To run the dockerized somatic mutation callers} \label{dockerized_callers}

For your convenience, we have created a couple of scripts that can generate run script for the dockerized somatic mutation callers.

Expand Down Expand Up @@ -1077,7 +1076,7 @@ \subsection{Version 2.8.0}
\begin{itemize}

\item
The program will now throw an exception and crash if the VCF file(s) are not sorted according to the .fasta reference file.
The program is now designed to crash if the VCF file(s) are not sorted according to the .fasta reference file.

\end{itemize}

Expand Down
Loading

0 comments on commit aea6750

Please # to comment.