bystro-vcf

TL;DR

Annotate VCF files at millions of variants per minute. Saturates pigz/gunzip on a 4-core CPU.

go get github.com/akotlar/bystro-vcf && go install $_;

pigz -p 1 -d -c in.vcf.gz | bystro-vcf --keepId --keepInfo | pigz -c - > output

Description

Performs several important functions:

Splits multiallelics and MNP alleles, keeping track of each allele's index with respect to the original alleles for downstream INFO property segregation
Performs QC on variants: checks whether allele contains ACTG, that padding bases match reference, and more
Allows filtering of variants by any number of FILTER properties (by default allows PASS/. variants)
Normalizes indel representations by removing padding, left shifting alleles to their parsimonious representations
Calculates whether site is transition, transversion, or neither
Processes all available samples
- calculates homozygosity, heterozygosity, missingness
- labels samples as homozygous, heterozygous, or missing

Publication

bystro-vcf is used to pre-proces VCF files for Bystro (github)

If you use bystro-vcf please cite https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1387-3

Performance

Millions of variants/rows per minute. Performance is dependent on the # of samples.

Ex:

Amazon i3.2xlarge (4 core), 1K Genomes Phase 3 (2,504 samples): chromosome 1 (6.2M variants) in ~2 minutes 45s

Runs @ ~ pigz -p 1 streaming decompression limit (97% CPU, 2% sys post-Meltdown/Spectre).

==> ( time pigz -d -c -p 1 ../../../mnt/annotator/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | bystro-vcf  &> /dev/null ; )

real	2m45.134s
user	16m20.512s
sys	0m25.940s

Installation

go get github.com/akotlar/bystro-vcf && go install $_;

Use

Via pipe:

pigz -d -c in.vcf.gz | bystro-vcf --keepId --keepInfo --allowFilter "PASS,." | pigz -c - > out.gz

Via inPath argument:

bystro-vcf --in in.vcf --keepId --keepInfo --allowFilter "PASS,." > out

Output

chrom <String>   pos <Int>   type <String[SNP|DEL|INS|MULTIALLELIC]>    ref <String>    alt <String>    trTv <Int[0|1|2]>     heterozygotes <String>     heterozygosity <Float64>    homozygotes <String>     homozygosity <Float64>     missingGenos <String>    missingness <Float64>    sampleMaf <Float64>    id <String?>    alleleIndex <Int?>   info <String?>

Optional arguments

--keepId <Bool>

Retain the "ID" field in the output.

--keepInfo <Bool>

Retain the "INFO" field in the output.

Since we decompose multiallelics, an "alleleIdx" field is added to the output. It contains the 0-based index of that allele in the multiallelic
This is necessary for downstream programs to decompose the INFO field per-allele

Results in 2 output fields, following missingGenos or id should --keepId be set

alleleIdx will contain the index of allele in a split multiallelic. 0 by default.
info will contain the entire INFO string

--allowFilter <String>

Which FILTER values to keep. Comma separated. Defaults to "PASS,.".

If passed "" (empty string) or "*" (wildcard) will allow all FILTER values.

Similar to https://samtools.github.io/bcftools/bcftools.html -f, --apply-filters LIST

--excludeFilter <String>

Which FILTER values to exclude. Comma separated. Defaults to ""

Opposite of https://samtools.github.io/bcftools/bcftools.html -f, --apply-filters LIST

--in /path/to/uncompressedFile.vcf

An input file path, to an uncompressed VCF file. Defaults to stdin

--out <String>

Send the output here instead of STDOUT

--err /path/to/log.txt

Where to store log messages. Defaults to stderr

--emptyField "!"

Which value to assign to missing data. Defaults to !

--fieldDelimiter ";"

Which delimiter to use when joining multiple values. Defaults to ;

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
arrow		arrow
examples		examples
previous_out_check		previous_out_check
.travis.yml		.travis.yml
LICENCE		LICENCE
README.md		README.md
bench_test.go		bench_test.go
benchmarks_test.go		benchmarks_test.go
go.mod		go.mod
go.sum		go.sum
main.go		main.go
main_test.go		main_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bystro-vcf

TL;DR

Description

Publication

Performance

Installation

Use

Output

Optional arguments

About

Releases 12

Packages

Languages

License

bystrogenomics/bystro-vcf

Folders and files

Latest commit

History

Repository files navigation

bystro-vcf

TL;DR

Description

Publication

Performance

Installation

Use

Output

Optional arguments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 12

Packages 0

Languages

Packages