Skip to content

Pre_Variant_Filtering

Chaochih Liu edited this page Mar 4, 2021 · 4 revisions

Basic Usage

The Pre_Variant_Filtering handler is intended to do a pre-variant filtering analysis to help users decide on cutoffs to use when filtering the VCF file in the Variant_Filtering handler. This handler graphs annotations in the VCF file (e.g., QUAL, DP, SOR, MQRankSum, etc.) and shows distributions of these VCF annotations. This handler also generates percentile tables for GQ and DP per sample. This handler is intended to be run prior to Variant_Filtering, especially if users ran the Variant_Recalibrator handler (sites that didn't pass the model are removed in this handler leaving only PASS sites).

./sequence_handling Pre_Variant_Filtering /path/to/Config

Currently, annotation plots use a set of default cutoffs, in the future we hope to allow users to specify cutoffs to use in the visualization.

Handler-Specific Variables

The following are a list of variables that need to be defined within Config. In addition to the handler-specific variables, all common variables must be defined.

Variable Function
PVF_QUEUE Queue we are using for batch submission. If using Slurm, users can specify multiple partitions that are comma-separated "small,ram256g,ram1t". If using PBS, specify one queue "small".
PVF_SBATCH If using Slurm, this is the Slurm settings for batch submission. Delete if not using Slurm. Recommended settings are "--nodes=1 --ntasks-per-node=16 --mem=110gb --tmp=22gb -t 72:00:00 --mail-type=ALL --mail-user=${EMAIL} -p ${PVF_QUEUE}"
PVF_QSUB If using PBS, this is QSub settings for batch submission. Delete if not using PBS. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00".
PVF_VCF The full filepath to the VCF file we can use to graph annotations and generate percentile tables. If you ran the Variant_Recalibrator handler, this will be the *.recalibrated.vcf.gz file.
PVF_GEN_NUM The number of genomic regions to subset for graphing annotation purposes. This step is necessary for large VCF files (e.g., >500GB). Default is 1000000.
PVF_GEN_LEN The length (bp) of genomic regions to subset for graphing annotation purposes. Default is 100

Output

Pre_Variant_Filtering generates a plot of annotation distributions, percentile tables, and a "PASS" sites only VCF (if users ran Variant_Recalibrator). The VCF file can be found at:

${OUT_DIR}/Pre_Variant_Filtering
${OUT_DIR}/Pre_Variant_Filtering/Percentile_Tables

The plots generated include:

  1. Distributions of common GATK variant annotations (i.e., Quality by Depth, Strands Odds Ratio, Mapping Quality Rank Sum Test, Depth, Fisher Strand, Mapping Quality, Read Position Rank Sum Test, and Quality Score)
  2. Per-site and individual missingness present
  3. Heterozygosity, Excess Heterozygosity, and Inbreeding Coefficients

Dependencies

Pre_Variant_Filtering depends on the following software:

Dependency Use
GATK 4.1.2 Used for 1) selecting "PASS" sites only from a recalibrated VCF file and 2) creating a variant table for plotting annotations.
R 3.6.3 (or greater) Used for 1) plotting annotation distributions, 2) plotting individual and site missingness, and 3) generating percentile tables. Requires R packages: ggplot, gridExtra, and bigmemory
bedtools 2.17.0 (or greater) Used for creating intervals file to subset genome randomly at ${gen_num} ${gen_len}bp regions.
vcftools 0.1.14 Used to 1) generate files with individual and site missingness info for visualizing and 2) extract DP per sample info for percentile tables.