Skip to content
This repository has been archived by the owner on Jan 27, 2020. It is now read-only.

Latest commit

 

History

History
126 lines (101 loc) · 8.72 KB

OUTPUT.md

File metadata and controls

126 lines (101 loc) · 8.72 KB

Sarek - output delivery

This document describes the output delivery directory structure.

There are four sections dedicated for different results: Annotation, Preprocessing, Reports and VariantCalling. All the four sections can have sub-directories containing results from different software.

Annotation:

This directory contains results from the final annotation steps: two software are used for annotation, VEP and snpEff. Only a subset of the VCF files are annotated, and only variants that have a PASS filter. FreeBayes results are not annotated in the moment yet as we are lacking a decent somatic filter. For HaplotypeCaller the germline variations are annotated for both the tumour and the normal sample.

All the VCFs annotated have an ann.vcf extension, and a summary HTML file associated.

SnpEff

SnpEff can add annotations for many sort of variants not only SNPs, and is using multiple databases for annotations. SnpEff prints out not only the annotated VCF files, but a summary HTML and CSV, also a list of affected genes with the actual changes and impact is included in a text file. The generated VCF header contains the software version and the used command line.

Annotations added are in cancer mode are very rich, Sarek is using the software in a single-sample mode. VCF files containing germline calls are annotated in regular mode of SnpEff.

VEP

The Variant Effect Predictor is based on Ensembl, and can determine the effects of all sorts of variants, including SNPs, indels, structural variants, CNVs. Some of the Manta VCF files are not always succeed in going through the VEP filtering though: there can be missing annotations for these variant calls.

The HTML summary files show general statistics and quality-related measures. In the header of the annotated VCF files one can find the VEP/Ensembl version used for annotation, also the version numbers for additional databases like Clinvar or dbSNP used in the "VEP" line. The format of the consequence annotations is also in the VCF header describing the INFO field. In the moment it contains:

  • Consequence: impact of the variation, if there is any
  • Codons: the codon change, i.e. cGt/cAt
  • Amino_acids: change in amino acids, i.e. R/H if there is any
  • Gene: ENSEMBL gene name
  • SYMBOL: gene symbol
  • Feature: actual transcript name
  • EXON: affected exon
  • PolyPhen: prediction based on PolyPhen
  • SIFT: prediction by SIFT
  • Protein_position: Relative position of amino acid in protein
  • BIOTYPE: Biotype of transcript or regulatory feature

Preprocessing:

The preprocessing is following the GATK Best Practices to obtain aligned BAM files used for whole-genome germline analysis.

DuplicateMarked:

This is the place for the BAM file delivered to users: besides the duplicatemarked files the recalibration tables are also stored (*.recal.table), these can be used to create base recalibrated files. The .tsv file is autogenerated also, these can be used by Sarek for further processing and/or variant calling.

The BAM file headers contain the details about the actual command-line arguments for mapping, merging, use samtools view -H <filename> to view the used reference, read groups etc.

Recalibrated:

This directory is usually empty, it is the location for the final recalibrated files in the preprocessing pipeline: recalibrated BAMs are usually 2-3 times larger than the duplicatemarked files. To re-generate recalibrated BAMs you have to apply the recalibration table delivered to the NonRecalibrated directory either by calling Sarek, or doing this recalibration step yourself.


Reports:

The Reports directory is the place for collecting outputs for different quality control (QC) software; going through these files can help us to decide whether the sequencing and the workflow was successful, or further steps are needed to get meaningful results. The main entry point it the MultiQC directory: the HTML index file aggregates and visualizes all the software use for QC.

MultiQC

To assess the quality of the sequencing and workflow the best start is to view at the Reports/MultiQC/multiqc_report.html file of the MultiQC directory, where the statistics and graphics of all the software below should be presented. The actual graphs and the tables are configurable, and generally much easier to view than the raw output of the individual software. The subsequent QC compartments are:

  • bamQC: Qualimap examines sequencing alignment data in SAM/BAM files according to the features of the mapped reads and provides an overall view of the data provides quality control statistics about aligned BAM files
  • BCFToolsStats: bcftools measuring non-reference allele frequency, depth distribution, stats by quality and per-sample counts, singleton stats, etc. of VCF files.
  • FastQC: provides statistics about the raw FASTQ files only.
  • MarkDuplicates: a Picard tool to tag PCR/optical duplicates from aligned BAM data.
  • SamToolsStats: samtools collection of statistics from BAM files.

VariantCalling:

All the raw results regarding variant-calling are collected in this directory. Not all the software below are producing VCF files, also both somatic and germline variants are collected in this directory.

  • Ascat: is a method to derive copy number profiles of tumour cells, accounting for normal cell admixture and tumour aneuploidy. This directory contains the graphical output of the software, CNV, ploidy and sample purity estimations.
  • FreeBayes: is for Bayesian haplotype-based genetic polymorphism discovery and genotyping. The single VCF file generated by FreeBayes is huge, it is recommended to flatten and filter this VCF, i.e. using the provided SpeedSeq filter.
  • HaplotypeCaller is the in-house germline caller of the Broad Institute, the non-recalibrated variant files are there to check the germline variations and compare the two samples (tumour and normal) for possible mixup.
  • HaplotypeCallerGVCF: germline calls in gVCF format even for the tumour sample: this format makes possible the joint analysis of a cohort.
  • Manta: is a structural variant caller supported by Illumina. There are several output files, corresponding to germline (diploid) calls, candidate calls and somatic files. Manta provides a candidate list for small indels also that can be fed to Strelka.
  • MuTect2 is the current somatic caller of GATK for both SNPs and indels. Recommended to keep only lines with the "PASS" filter.
  • Strelka2 is somatic SNP and indel caller supported by Illumina. Strelka gives filtered and unfiltered calls for SNPs and indels separately, together with germline calls.