This documents how to use vcfanno
to annotate a VCF with CADD scores.
"CADD is a tool for scoring the deleteriousness of single nucleotide variants as well as insertion/deletions variants in the human genome." Users of CADD should refer to the web page for citation and use requirements.
From here, we can specify a conf file:
[[annotation]]
file="whole_genome_SNVs.tsv.gz"
names=["cadd_raw", "cadd_phred"]
ops=["mean", "mean"]
columns=[4, 5]
And we can run vcfanno
as:
vcfanno -p 12 cadd.conf query.vcf > query.anno.vcf
As an extreme case, we can run this on the ExAC VCF:
vcfanno -p 18 cadd.conf ExAC.r0.3.sites.vep.tidy.vcf.gz | bgzip -c > /tmp/exac-cadd.vcf.gz
This takes about 88 minutes on a good server. This time will improve in future versions but it
is due to the large number of lines that must be parsed from the CADD VCF, even with the algorithm
that allows it to avoid parsing annotation intervals that fall in large gaps in the query. By
comparison, bedtools intersect -sorted
takes 92 minutes for this same overlap.
This will only work for single-nucleotide variants since the default for VCF is to match on REF and ALT.