vcf2db cli error #42

matthdsm · 2018-04-24T07:56:00Z

Hi Brent,

I got the following error using vcf2db:

[login] matdsmet:2018-04-21_NSQ_147 $ vcf2db.py D1514352-joint-gatk-haplotype-joint-annotated-decomposed.vcf.gz D1514352.ped D1514352-joint-gatk-haplotype-joint.db                                      [9:33:35]
/Shared/Software/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value 'D1514352-joint'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
/Shared/Software/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
WARNING: unknown severity for '-|frameshift_variant&start_lost&start_retained_variant|HIGH|HRNR|388697|Transcript|NM_001009931.2|protein_coding|2/3||NM_001009931.2:c.1del|NP_001009931.1:p.Met1?|80/9632|1/8553|1/2850|M/X|Atg/tg|rs34061715&COSM111478|1||-1||deletion|EntrezGene|HGNC:20846|YES||||NP_001009931.1||||rseq_mrna_match|RefSeq|T|T||||||||0.7337|0.8818|0.9544|0.9592|0.8875|||0.9028|0.7227|0.8276|0.9554|0.9063|0.9541|0.9411|0.9142|0.9069|0.9592|EUR||0&1|0&1||||||HC|||||||'. using LOW for [u'frameshift_variant', u'start_lost', u'start_retained_variant']
Please report this on github with the effect-string above

It seems vcf2db has a problem with some data inside the CSQ tag from VEP. I also noticed not all data from the CSQ tag makes it into the db. Especially annoying for us is that for some reason, the canonical column is nowhere to be found, even though it's in the VCF.

I've included a sample vcf, so you might be able to replicate the error.
The vcf was created using bcbio v1.0.9

Cheers and thanks for the help
M

The text was updated successfully, but these errors were encountered:

matthdsm · 2018-04-24T08:12:48Z

Correction, the missing column only occurs in db's created by bcbio!
Ping @chapmanb. Do you have any idea where this could occur? When manually creating a db from the decomposed vcf in the final dir, all data is present, but the pregenerated db's are missing the canonical column...

M

brentp · 2018-04-24T15:02:06Z

could be that bcbio needs to update either vcf2db or geneimpacts ?

matthdsm · 2018-04-24T15:31:37Z

Hi Brent,

Thanks for the reply. I don't think so, bcbio has the latest versions available in conda...

# Name                    Version                   Build  Channel
vcf2db                    2018.01.23               py27_0    bioconda
geneimpacts               0.3.4                    py27_0    bioconda

M

chapmanb · 2018-04-25T18:36:32Z

Matthias and Brent;
Sorry about the issue and thanks for cc'ing me into the conversation. I'm not sure why you'd see a different between bcbio vcf2db.py runs and those done by hand. bcbio isn't doing much fancy here beyond calling the command:

https://github.com/bcbio/bcbio-nextgen/blob/1083b6768511c20c5d006b0a49758fc6d3bb2167/bcbio/variation/population.py#L67

If you look in log/bcbio-nextgen-commands.log, is the same command being run that you're testing outside? Are there any other differences that could help us explore what is happening? Thanks for the help debugging.

matthdsm · 2018-04-26T06:23:06Z

Hi Brent, Brad.

I did some more digging, and I think the error is coming from the geneimpacts module.
Our HPC cloud has:

#packages
-bash-4.2$ conda list vcf2db
# packages in environment at /user/data/gent/gvo000/gvo00082/vsc41443/bcbio/anaconda:
#
# Name                    Version                   Build  Channel
vcf2db                    2018.01.23               py27_0    bioconda
-bash-4.2$ conda list geneimpacts
# packages in environment at /user/data/gent/gvo000/gvo00082/vsc41443/bcbio/anaconda:
#
# Name                    Version                   Build  Channel
geneimpacts               0.3.4                    py27_0    bioconda

This is the setup that does NOT throw the annotation errors, but lacks the canonicalcolumn.

Our local cloud has:

[login] matdsmet:final $ conda list vcf2db                                                                                                                                                               [8:11:44]
 # packages in environment at /Shared/Software:
#
# Name                    Version                   Build  Channel
vcf2db                    2018.01.23               py27_0    bioconda
[login] matdsmet:final $  conda list geneimpacts                                                                                                                                                         [8:12:27]
# packages in environment at /Shared/Software:
#
# Name                    Version                   Build  Channel
geneimpacts               0.3.1                    py27_0    bioconda

Which DOES throw the annotation errors, but includes all columns.

Hope this helps to pinpoint the bug.

Cheers
M

matthdsm · 2018-04-26T06:25:05Z

Looking at the geneimpacts repo, it probably has something to do with this PR
brentp/geneimpacts#13

matthdsm · 2018-04-26T06:43:11Z

Just tested and can confirm, the missing column only occurs with geneimpacts 0.3.4.
I ran a test with 0.3.3, which fixed all errors.

Cheers
M

chapmanb · 2018-04-27T15:34:30Z

cc'ing in Rory (@roryk) on this since it was his PR. Rory, do you know what might be going on here?

roryk · 2018-04-27T15:42:56Z

Sorry about that folks and thanks for tracking down the problem, I'll take a look.

roryk · 2018-04-30T15:16:19Z

Thanks, for the reproducible example, I'm fixing this now.

roryk · 2018-04-30T15:23:55Z

This seems to work fine with geneimpacts 0.3.4 and vcf2db 2017.12.11 and 2018.01.23:

vcf2db.py sample1_chr1.vcf.gz ped.ped sample1.db
/home/rdk4/local/share/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:219: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
3057 variant_impacts:43730	effects time: 13.7	chunk time:26.3	116.26 variants/second
indexing ... finished in 0.6 seconds...
total time: in 27.4 seconds...

>>> import geneimpacts
>>> geneimpacts.__version__
'0.3.4'
>>>

conda list vcf2db
vcf2db                    2017.12.11               py27_0    bioconda

Same result with:

vcf2db                    2018.01.23               py27_0    bioconda

roryk · 2018-04-30T15:46:54Z

Oh I see, it doesn't throw the errors with 0.3.4 but is missing the canonical column. I can read.

roryk · 2018-04-30T18:58:39Z

AFAICT, It looks like vcf2db isn't using the CANONICAL column at all. In that VCF file there are 1756 variants labelled as CANONICAL:

gzip -cd sample1_chr1.vcf.gz | grep -v '^#' | cut -f26 -d'|'  | sort | uniq -c
   1301
   1756 YES

I opened two pull requests, one to have geneimpacts set is_canonical instead of canonical (brentp/geneimpacts#14) and one to have vcf2db.py set the is_canonical column: (#43) for VEP-annotated variants.

Loading the database like this:

python /n/app/bcbio/dev/rory-dev/vcf2db/vcf2db.py sample1_chr1.vcf.gz ped.ped sample1.db

gemini query -q "select is_canonical from variants" sample1.db | sort | uniq -c
   1173 0
   1884 1

I'm not super sure where the discrepancy is coming from.

brentp · 2018-04-30T19:07:17Z

not sure I understand how you go more canonicals in variants that you do in the VCF.

but, I don't see where you are telling the geneimpacts classes to prioritize_canonical.

brentp · 2018-04-30T19:10:04Z

for your first command, counting in the vcf, I would use:

bcftools query -f "%INFO/CSQ" $vcf | awk 'BEGIN{FS="|"}(NF > 25) { print $26 }'

or something like that to make sure you're not getting missing columns.

roryk · 2018-04-30T19:18:05Z

Thanks, Brent, looks similar:

bcftools query -f %INFO/CSQ"\n" sample1_chr1.vcf.gz | awk 'BEGIN{FS="|"}(NF > 25) { print $26 }' | sort | uniq -c
   1301
   1756 YES

I was trying to fix the issue where the canonical status isn't populated to the variants table, even though it is set. So variants aren't necessarily prioritized by canonical status, but the status doesn't show up in the table anyway.

brentp · 2018-04-30T19:27:13Z

what does your vcf header for CSQ look like?

roryk · 2018-04-30T19:28:32Z

Heya, this is the VCF linked here from further up in the issue:

https://github.com/quinlan-lab/vcf2db/files/1941507/sample1_chr1.vcf.gz

roryk · 2018-04-30T19:53:45Z

Duh, I see-- some of the higher impact variants are also canonical, which affects what is loaded in the main table. I pushed another change to vcf2db.py to include a prioritize_canonical flag to actually use the canonical prioritization when loading up the database, so now there are two changes. If canonical is set, add it as a field to the main variants table. If it is set, and prioritize_canonical is set, prefer canonical variants above all else. Feel free to thumbs down all of these changes, I just rabbitholed on the original issue.

brentp · 2018-04-30T20:04:59Z

also, I think your is_canonical should be:

      @property
      def is_canonical(self):                                                                                                                                                   
         return self.effects.get("CANONICAL", "") != ""

so that it is forced to return a boolean. otherwise, I get an error because it's using "YES"

matthdsm · 2018-06-05T07:04:19Z

Hi Brent,

Sorry to bother you again, but this issue still isn't fixed. I get the same error using the latest version of vcf2db/geneimpacts.

Are there any logs of whatever I could provide to help debugging?
M

brentp · 2018-06-05T17:55:40Z

@matthdsm can you help me debug. I did this:

brentp@lorax:~/src/vcf2db$ python -c "import geneimpacts; print geneimpacts.__version__"
0.3.5
brentp@lorax:~/src/vcf2db$ python vcf2db.py sample1_chr1.vcf.gz sample1.ped x.db 
/data/gemini_install/data/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value 'fam1'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
/data/gemini_install/data/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
3057 variant_impacts:43730	effects time: 13.3	chunk time:24.9	122.76 variants/second
indexing ... finished in 0.1 seconds...
total time: in 25.4 seconds...
brentp@lorax:~/src/vcf2db$ echo ".schema variants" | sqlite3 x.db  | grep canon
	is_canonical BOOLEAN, 
	CHECK (is_canonical IN (0, 1)), 
brentp@lorax:~/src/vcf2db$ echo "select is_canonical from variants;" | sqlite3 x.db  | sort | uniq -c
   1173 0
   1884 1
brentp@lorax:~/src/vcf2db$ echo "select is_canonical from variant_impacts;" | sqlite3 x.db  | sort | uniq -c
 35630 0
   8100 1
brentp@lorax:~/src/vcf2db$ python count-canon.py sample1_chr1.vcf.gz
[35630, 8100]

So count-canon.py shows that what's in the database matches what's in the VCF. I'm probably missing something obvious, but I don't know what is the problem you're encountering.

here is count-canon.py:

from cyvcf2 import VCF
import sys

ch = " Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|REFSEQ_MATCH|SOURCE|GIVEN_REF|USED_REF|BAM_EDIT|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|SpliceRegion".split("|") 

canons = [0, 0]
for v in VCF(sys.argv[1]):
    csq = v.INFO.get("CSQ")
    if csq is None:
        canons[0] += 1
    else:
        for c in csq.split(","):
            d = dict(zip(ch, c.split("|")))
            canons[int(d.get("CANONICAL") == "YES")] += 1

print canons

matthdsm · 2018-06-06T08:09:43Z

Hi Brent,

I got the following error:

Traceback (most recent call last):
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 920, in <module>
    impacts_extras=a.impacts_field, aok=a.a_ok)
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 232, in __init__
    self.load()
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 317, in load
    i = self._load(self.cache, create=True, start=1)
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 310, in _load
    self.insert(variants, expanded, keys, i, create=create)
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 372, in insert
    vilengths, variant_impacts)
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 400, in _insert
    self.__insert(v_objs, self.metadata.tables['variants'].insert())
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 434, in __insert
    raise e
sqlalchemy.exc.StatementError: (exceptions.TypeError) Not a boolean value:

This error occurs when using the bioconda version of geneimpacts. I noticed this version doesn't include @roryk's latest commits.
Would it work for you to update the latest release to include that commit?

When using a manual install of geneimpacts, this error doesn't occur AND the canonical col is fixed.

I do notice that the output of the SQL query and count-canon.py don't match..

[10:04:15] matdsmet:HSQ_136 $ echo "select is_canonical from variants;" | sqlite3 D1711586-joint-gatk-haplotype.db  | sort | uniq -c 
10974 0
21276 1
[10:04:30] matdsmet:HSQ_136 $ echo "select is_canonical from variant_impacts;" | sqlite3 D1711586-joint-gatk-haplotype.db  | sort | uniq -c 
401535 0
89323 1
[10:05:26] matdsmet:HSQ_136 $ python count-canon.py D1711586-joint-gatk-haplotype-joint-annotated-decomposed.vcf.gz 
[490858, 0]

Thanks for the help
M

matthdsm · 2018-06-06T08:13:07Z

For reference:

[10:10:13] matdsmet:HSQ_136 $ sqlite3 --version
3.23.1 2018-04-10 17:39:29 4bb2294022060e61de7da5c227a69ccd846ba330e31626ebcd59a94efd148b3b
[10:10:19] matdsmet:HSQ_136 $ python --version
Python 2.7.14 :: Anaconda, Inc.
[10:11:26] matdsmet:HSQ_136 $ python -c "import cyvcf2; print cyvcf2.__version__"            
0.8.4

chapmanb · 2018-06-06T09:39:15Z

Matthias;
Thanks for the catch. It looks like the latest release (0.3.5) that's in bioconda is missing the last two commits (it's tagged on afa1af3 https://github.com/brentp/geneimpacts/commits/master), which I guess is providing the fix. Brent, if that makes sense with you, could you be willing to tag a 0.3.6 including those changes and I'll update the bioconda recipe. Thanks again for the help debugging this.

brentp · 2018-06-06T13:10:40Z

I bumped and tagged v0.3.6. thanks guys.

chapmanb · 2018-06-06T17:48:09Z

Nice, thank you Brent. I bumped the bioconda package with this update. Matthias -- hope this gets everything working cleanly for you. Thanks again for the help debugging.

brentp closed this as completed Jun 6, 2018

chapmanb added a commit to chapmanb/bioconda-recipes that referenced this issue Jun 6, 2018

Update: geneimpacts to v0.3.6 fixes quinlan-lab/vcf2db#42

d97c5ba

chapmanb added a commit to bioconda/bioconda-recipes that referenced this issue Jun 6, 2018

Update: geneimpacts to v0.3.6 fixes quinlan-lab/vcf2db#42 (#9164)

5e5cadb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vcf2db cli error #42

vcf2db cli error #42

matthdsm commented Apr 24, 2018 •

edited

Loading

matthdsm commented Apr 24, 2018 •

edited

Loading

brentp commented Apr 24, 2018

matthdsm commented Apr 24, 2018

chapmanb commented Apr 25, 2018

matthdsm commented Apr 26, 2018

matthdsm commented Apr 26, 2018

matthdsm commented Apr 26, 2018

chapmanb commented Apr 27, 2018

roryk commented Apr 27, 2018

roryk commented Apr 30, 2018

roryk commented Apr 30, 2018 •

edited

Loading

roryk commented Apr 30, 2018

roryk commented Apr 30, 2018

brentp commented Apr 30, 2018

brentp commented Apr 30, 2018 •

edited

Loading

roryk commented Apr 30, 2018

brentp commented Apr 30, 2018

roryk commented Apr 30, 2018

roryk commented Apr 30, 2018

brentp commented Apr 30, 2018

matthdsm commented Jun 5, 2018

brentp commented Jun 5, 2018

matthdsm commented Jun 6, 2018 •

edited

Loading

matthdsm commented Jun 6, 2018

chapmanb commented Jun 6, 2018

brentp commented Jun 6, 2018

chapmanb commented Jun 6, 2018

vcf2db cli error #42

vcf2db cli error #42

Comments

matthdsm commented Apr 24, 2018 • edited Loading

matthdsm commented Apr 24, 2018 • edited Loading

brentp commented Apr 24, 2018

matthdsm commented Apr 24, 2018

chapmanb commented Apr 25, 2018

matthdsm commented Apr 26, 2018

matthdsm commented Apr 26, 2018

matthdsm commented Apr 26, 2018

chapmanb commented Apr 27, 2018

roryk commented Apr 27, 2018

roryk commented Apr 30, 2018

roryk commented Apr 30, 2018 • edited Loading

roryk commented Apr 30, 2018

roryk commented Apr 30, 2018

brentp commented Apr 30, 2018

brentp commented Apr 30, 2018 • edited Loading

roryk commented Apr 30, 2018

brentp commented Apr 30, 2018

roryk commented Apr 30, 2018

roryk commented Apr 30, 2018

brentp commented Apr 30, 2018

matthdsm commented Jun 5, 2018

brentp commented Jun 5, 2018

matthdsm commented Jun 6, 2018 • edited Loading

matthdsm commented Jun 6, 2018

chapmanb commented Jun 6, 2018

brentp commented Jun 6, 2018

chapmanb commented Jun 6, 2018

matthdsm commented Apr 24, 2018 •

edited

Loading

matthdsm commented Apr 24, 2018 •

edited

Loading

roryk commented Apr 30, 2018 •

edited

Loading

brentp commented Apr 30, 2018 •

edited

Loading

matthdsm commented Jun 6, 2018 •

edited

Loading