-
Notifications
You must be signed in to change notification settings - Fork 8
Pre_Variant_Filtering
The Pre_Variant_Filtering
handler is intended to do a pre-variant filtering analysis to help users decide on cutoffs to use when filtering the VCF file in the Variant_Filtering
handler. This handler graphs annotations in the VCF file (e.g., QUAL, DP, SOR, MQRankSum, etc.) and shows distributions of these VCF annotations. This handler also generates percentile tables for GQ and DP per sample. This handler is intended to be run prior to Variant_Filtering, especially if users ran the Variant_Recalibrator handler (sites that didn't pass the model are removed in this handler leaving only PASS sites).
./sequence_handling Pre_Variant_Filtering /path/to/Config
Currently, annotation plots use a set of default cutoffs, in the future we hope to allow users to specify cutoffs to use in the visualization.
The following are a list of variables that need to be defined within Config
. In addition to the handler-specific variables, all common variables must be defined.
Variable | Function |
---|---|
PVF_QUEUE |
Queue we are using for batch submission. If using Slurm, users can specify multiple partitions that are comma-separated "small,ram256g,ram1t" . If using PBS, specify one queue "small" . |
PVF_SBATCH |
If using Slurm, this is the Slurm settings for batch submission. Delete if not using Slurm. Recommended settings are "--nodes=1 --ntasks-per-node=16 --mem=110gb --tmp=22gb -t 72:00:00 --mail-type=ALL --mail-user=${EMAIL} -p ${PVF_QUEUE}"
|
PVF_QSUB |
If using PBS, this is QSub settings for batch submission. Delete if not using PBS. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . |
PVF_VCF |
The full filepath to the VCF file we can use to graph annotations and generate percentile tables. If you ran the Variant_Recalibrator handler, this will be the *.recalibrated.vcf.gz file. |
PVF_GEN_NUM |
The number of genomic regions to subset for graphing annotation purposes. This step is necessary for large VCF files (e.g., >500GB). Default is 1000000 . |
PVF_GEN_LEN |
The length (bp) of genomic regions to subset for graphing annotation purposes. Default is 100
|
Pre_Variant_Filtering generates a plot of annotation distributions, percentile tables, and a "PASS" sites only VCF (if users ran Variant_Recalibrator). The VCF file can be found at:
${OUT_DIR}/Pre_Variant_Filtering
${OUT_DIR}/Pre_Variant_Filtering/Percentile_Tables
The plots generated include:
- Distributions of common GATK variant annotations (i.e., Quality by Depth, Strands Odds Ratio, Mapping Quality Rank Sum Test, Depth, Fisher Strand, Mapping Quality, Read Position Rank Sum Test, and Quality Score)
- Per-site and individual missingness present
- Heterozygosity, Excess Heterozygosity, and Inbreeding Coefficients
Pre_Variant_Filtering depends on the following software:
Dependency | Use |
---|---|
GATK 4.1.2 | Used for 1) selecting "PASS" sites only from a recalibrated VCF file and 2) creating a variant table for plotting annotations. |
R 3.6.3 (or greater) | Used for 1) plotting annotation distributions, 2) plotting individual and site missingness, and 3) generating percentile tables. Requires R packages: ggplot , gridExtra , and bigmemory
|
bedtools 2.17.0 (or greater) | Used for creating intervals file to subset genome randomly at ${gen_num} ${gen_len}bp regions. |
vcftools 0.1.14 | Used to 1) generate files with individual and site missingness info for visualizing and 2) extract DP per sample info for percentile tables. |
Next: Variant_Filtering
- Getting Started
- Recommended Workflow
- Configuration
- Dependencies
- sample_list_generator.sh
- Slurm specific options
- Common Problems and Errors