doc

bioinform · Jul 2, 2018 · aea6750 · aea6750
1 parent f46da8a
commit aea6750
Show file tree

Hide file tree

Showing 3 changed files with 184 additions and 39 deletions.
diff --git a/docs/Manual.pdf b/docs/Manual.pdf
diff --git a/docs/Manual.tex b/docs/Manual.tex
@@ -67,7 +67,7 @@
 
 
 
-\section{Introduction}
+\section{Introduction} \label{Introduction}
 
 SomaticSeq is a flexible post-somatic-mutation-calling workflow for improved accuracy. We have incorporated multiple somatic mutation caller(s) to obtain a combined call set, and then it uses machine learning to distinguish true mutations from false positives from that call set. We have incorporated the following somatic mutation caller: MuTect/Indelocator/MuTect2, VarScan2, JointSNVMix, SomaticSniper, VarDict, MuSE, LoFreq, Scalpel, Strelka, and TNscope. You may incorporate some or all of those callers into your own pipeline with SomaticSeq.
 
@@ -126,14 +126,14 @@ \subsection{Docker repos}
 \end{itemize}
 
 
-\section{To use SomaticSeq.Wrapper.sh}
+\section{To use SomaticSeq.Wrapper.sh} \label{Wrapper_script}
 
-The SomaticSeq.Wrapper.sh is a wrapper script that calls a series of programs and procedures \textbf{after} you have run your individual somatic mutation callers. In the next section, we will describe the workflow in more detail, so you may not be dependent on this wrapper script. You can either modify this wrapper script or create your own workflow. 
+The SomaticSeq.Wrapper.sh is a wrapper script that calls a series of programs and procedures \textbf{after} you have run your individual somatic mutation callers. Section \ref{dockerized_callers} will teach you how to run those callers that we have dockerized, plus ways to create semi-simulated training data. In next section, we will describe the workflow in this wrapper script in detail, so you may not be dependent on this wrapper script. You can either modify this wrapper script or create your own workflow using whatever workflow language you want. 
 
 
 \subsection{To train data set into a classifier}
 
-To create a trained classifier, ground truth files are required for the data sets. There is also an option to include a list of regions to include and/or exclude. The exclusion or inclusion regions can be VCF or BED files. An inclusion region may be subset of the call sets where you have validated their true/false mutation status, so that only those regions will be used for training. An exclusion region can be regions where the ``truth'' is ambigious. 
+To create SomaticSeq classifiers, ground truth VCF files are required for the data sets. There is also an option to include a list of regions to include and/or exclude. The exclusion or inclusion regions can be VCF or BED files. An inclusion region may be subset of the call sets where you have validated their true/false mutation status, so that only those regions will be used for training. An exclusion region can be regions where the ``truth'' is ambigious. 
 
 All the output VCF files from individual callers are optional. Those VCF files can be bgzipped if they have .vcf.gz extensions. It is imperative that the parameters used for the training and prediction are identical. 
 
@@ -225,15 +225,16 @@ \subsection{To classify based on simple majority vote}
 
 
 
-\section{The step-by-step SomaticSeq Workflow}
+\section{The step-by-step SomaticSeq Workflow} \label{step_by_step_procedures}
 
 We'll describe the workflow here, so you may modify the workflow and/or create your own workflow instead of using the wrapper script described previously. 
 
 
 \subsection{Combine the call sets}
-	We use utilities/getUniqueVcfPositions.py and vcfsorter.pl to combine the VCF files from different callers. The intermediate VCF files were modified to separate SNVs and INDELs.
+We use utilities/getUniqueVcfPositions.py and vcfsorter.pl to combine the VCF files from different callers. The intermediate VCF files were modified to separate SNVs and INDELs.
+
+VCF modifications were previously done to primarily make them compatible with GATK CombineVariants. We no longer use CombineVariants, so some of them are legacy, but this step also serves to remove some REJECT calls from some callers, in order to reduce their sizes. 
 
-VCF modifications were previously done to primarily make them compatible with GATK CombineVariants. We no longer depend on CombineVariants, so some of the steps are legacy. 
 \begin{enumerate}
 
 % MuTect and Indelocator
@@ -354,13 +355,11 @@ \subsection{Combine the call sets}
 
 % GATK CombineVariants
 \item
-Finally, with the VCF files modified, you may combine them with GATK CombineVariants: one for SNV and one for indel separately. There is no particular reason to use GATK CombineVariants. Other combiners should also work. The only useful thing here is to combine the calls and GATK CombineVariants does it well and pretty fast. 
+Finally, with the VCF files modified, you need combine them: one for SNV and one for indel separately.
 
 \begin{lstlisting}
 # Combine the VCF files for SNV. Any or all of the VCF files may be present.
-# -nt 6 means to use 6 threads in parallel
-java -jar $PATH/TO/GenomeAnalysisTK.jar -T CombineVariants -R genome.GRCh37.fa -nt 6 --setKey null --genotypemergeoption UNSORTED -V mutect.vcf -V varscan.snp.vcf -V jointsnvmix.vcf -V snp.vardict.vcf -V muse.vcf --out CombineVariants.snp.vcf
-java -jar $PATH/TO/GenomeAnalysisTK.jar -T CombineVariants -R genome.GRCh37.fa -nt 6 --setKey null --genotypemergeoption UNSORTED -V indelocator.vcf -V varscan.snp.vcf -V indel.vardict.vcf --out CombineVariants.indel.vcf
+utilities/getUniqueVcfPositions.py -vcfs mutect.vcf varscan.snp.vcf jointsnvmix.vcf snp.vardict.vcf muse.vcf -out CombineVariants.snp.vcf
 \end{lstlisting}
 
 
@@ -458,11 +457,11 @@ \subsection{Model Training or Mutation Prediction}
 Model training:
 \begin{lstlisting}
 # Training:
-r_script/ada_model_builder.R Ensemble.sSNV.tsv
-r_script/ada_model_builder.R Ensemble.sINDEL.tsv
+r_scripts/ada_model_builder_ntChange.R Ensemble.sSNV.tsv   Consistent_Mates Inconsistent_Mates
+r_scripts/ada_model_builder_ntChange.R Ensemble.sINDEL.tsv Strelka_QSS Strelka_TQSS Consistent_Mates Inconsistent_Mates
 \end{lstlisting}
 
-Ensemble.sSNV.tsv.Classifier.RData and Ensemble.sINDEL.tsv.Classifier.RData will be created from model training.
+Ensemble.sSNV.tsv.Classifier.RData and Ensemble.sINDEL.tsv.Classifier.RData will be created from model training. The arguments after Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv tells the builder script to ignore those features in training. These features do not improve accuracy in our data sets (mostly WGS data, but they may help other data sets)
 
 
 Mutation prediction:
@@ -474,22 +473,22 @@ \subsection{Model Training or Mutation Prediction}
 \end{lstlisting}
 
 
-After mutation prediction, if you feel like it, you may convert Trained.sSNV.tsv and Trained.sINDEL.tsv into VCF files. Use -tools to list ONLY the individual tools used to have appropriately annotated VCF files. Accepted tools are CGA (for MuTect/Indelocator), VarScan2, JointSNVMix2, SomaticSniper, VarDict, MuSE, LoFreq, and/or Scalpel. To list a tool without having run it, the VCF will be annotated as if the tool was run but did not identify that position as a somatic variant. 
+After mutation prediction, if you feel like it, you may convert Trained.sSNV.tsv and Trained.sINDEL.tsv into VCF files. Use -tools to list ONLY the individual tools used to have appropriately annotated VCF files. Accepted tools are MuTect2/MuTect/Indelocator, VarScan2, JointSNVMix2, SomaticSniper, VarDict, MuSE, LoFreq, Scalpel, Strelka, and/or TNscope. To list a tool without having run it, the VCF will be annotated as if the tool was run but did not identify that position as a somatic variant, which is probably undesireable. 
 
 
 \begin{lstlisting}
 # Probability above 0.7 labeled PASS (-pass 0.7), and between 0.1 and 0.7 labeled LowQual (-low 0.1):
 # Use -all to include REJECT calls in the VCF file
 # Use -phred to convert probability values (between 0 to 1) into Phred scale in the QUAL column in the VCF file
 
-SSeq_tsv2vcf.py -tsv Trained.sSNV.tsv   -vcf Trained.sSNV.vcf   -pass 0.7 -low 0.1 -tools CGA VarScan2 JointSNVMix2 SomaticSniper VarDict MuSE LoFreq Strelka -all -phred
+SSeq_tsv2vcf.py -tsv Trained.sSNV.tsv   -vcf Trained.sSNV.vcf   -pass 0.7 -low 0.1 -tools MuTect2 VarScan2 JointSNVMix2 SomaticSniper VarDict MuSE LoFreq Strelka -all -phred
 
-SSeq_tsv2vcf.py -tsv Trained.sINDEL.tsv -vcf Trained.sINDEL.vcf -pass 0.7 -low 0.1 -tools CGA VarScan2 VarDict LoFreq Scalpel Strelka -all -phred
+SSeq_tsv2vcf.py -tsv Trained.sINDEL.tsv -vcf Trained.sINDEL.vcf -pass 0.7 -low 0.1 -tools MuTect2 VarScan2 VarDict LoFreq Scalpel Strelka -all -phred
 \end{lstlisting}
 
 
 
-\section{To run the dockerized somatic mutation callers}
+\section{To run the dockerized somatic mutation callers} \label{dockerized_callers}
 
 For your convenience, we have created a couple of scripts that can generate run script for the dockerized somatic mutation callers. 
 
@@ -1077,7 +1076,7 @@ \subsection{Version 2.8.0}
 \begin{itemize}
 
   \item
-  The program will now throw an exception and crash if the VCF file(s) are not sorted according to the .fasta reference file. 
+  The program is now designed to crash if the VCF file(s) are not sorted according to the .fasta reference file. 
 
 \end{itemize}