Skip to content

bystrogenomics/bystro-vcf

Repository files navigation

bystro-vcf Build Status

TL;DR

Annotate VCF files at millions of variants per minute. Saturates pigz/gunzip on a 4-core CPU.

go get github.com/akotlar/bystro-vcf && go install $_;

pigz -p 1 -d -c in.vcf.gz | bystro-vcf --keepId --keepInfo | pigz -c - > output

Description

Performs several important functions:

  1. Splits multiallelics and MNP alleles, keeping track of each allele's index with respect to the original alleles for downstream INFO property segregation
  2. Performs QC on variants: checks whether allele contains ACTG, that padding bases match reference, and more
  3. Allows filtering of variants by any number of FILTER properties (by default allows PASS/. variants)
  4. Normalizes indel representations by removing padding, left shifting alleles to their parsimonious representations
  5. Calculates whether site is transition, transversion, or neither
  6. Processes all available samples
    • calculates homozygosity, heterozygosity, missingness
    • labels samples as homozygous, heterozygous, or missing

Publication

bystro-vcf is used to pre-proces VCF files for Bystro (github)

If you use bystro-vcf please cite https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1387-3


Performance

Millions of variants/rows per minute. Performance is dependent on the # of samples.

Ex:

Amazon i3.2xlarge (4 core), 1K Genomes Phase 3 (2,504 samples): chromosome 1 (6.2M variants) in ~2 minutes 45s

  • Runs @ ~ pigz -p 1 streaming decompression limit (97% CPU, 2% sys post-Meltdown/Spectre).
==> ( time pigz -d -c -p 1 ../../../mnt/annotator/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | bystro-vcf  &> /dev/null ; )

real	2m45.134s
user	16m20.512s
sys	0m25.940s

Installation

go get github.com/akotlar/bystro-vcf && go install $_;

Use

Via pipe:

pigz -d -c in.vcf.gz | bystro-vcf --keepId --keepInfo --allowFilter "PASS,." | pigz -c - > out.gz

Via inPath argument:

bystro-vcf --in in.vcf --keepId --keepInfo --allowFilter "PASS,." > out

Output

chrom <String>   pos <Int>   type <String[SNP|DEL|INS|MULTIALLELIC]>    ref <String>    alt <String>    trTv <Int[0|1|2]>     heterozygotes <String>     heterozygosity <Float64>    homozygotes <String>     homozygosity <Float64>     missingGenos <String>    missingness <Float64>    sampleMaf <Float64>    id <String?>    alleleIndex <Int?>   info <String?>

Optional arguments

--keepId <Bool>

Retain the "ID" field in the output.


--keepInfo <Bool>

Retain the "INFO" field in the output.

  • Since we decompose multiallelics, an "alleleIdx" field is added to the output. It contains the 0-based index of that allele in the multiallelic
  • This is necessary for downstream programs to decompose the INFO field per-allele

Results in 2 output fields, following missingGenos or id should --keepId be set

  1. alleleIdx will contain the index of allele in a split multiallelic. 0 by default.
  2. info will contain the entire INFO string

--allowFilter <String>

Which FILTER values to keep. Comma separated. Defaults to "PASS,.".

If passed "" (empty string) or "*" (wildcard) will allow all FILTER values.


--excludeFilter <String>

Which FILTER values to exclude. Comma separated. Defaults to ""


--in /path/to/uncompressedFile.vcf

An input file path, to an uncompressed VCF file. Defaults to stdin


--out <String>

Send the output here instead of STDOUT


--err /path/to/log.txt

Where to store log messages. Defaults to stderr


--emptyField "!"

Which value to assign to missing data. Defaults to !


--fieldDelimiter ";"

Which delimiter to use when joining multiple values. Defaults to ;