Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

vcf2db cli error #42

Closed
matthdsm opened this issue Apr 24, 2018 · 27 comments
Closed

vcf2db cli error #42

matthdsm opened this issue Apr 24, 2018 · 27 comments

Comments

@matthdsm
Copy link
Contributor

matthdsm commented Apr 24, 2018

Hi Brent,

I got the following error using vcf2db:

[login] matdsmet:2018-04-21_NSQ_147 $ vcf2db.py D1514352-joint-gatk-haplotype-joint-annotated-decomposed.vcf.gz D1514352.ped D1514352-joint-gatk-haplotype-joint.db                                      [9:33:35]
/Shared/Software/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value 'D1514352-joint'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
/Shared/Software/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
WARNING: unknown severity for '-|frameshift_variant&start_lost&start_retained_variant|HIGH|HRNR|388697|Transcript|NM_001009931.2|protein_coding|2/3||NM_001009931.2:c.1del|NP_001009931.1:p.Met1?|80/9632|1/8553|1/2850|M/X|Atg/tg|rs34061715&COSM111478|1||-1||deletion|EntrezGene|HGNC:20846|YES||||NP_001009931.1||||rseq_mrna_match|RefSeq|T|T||||||||0.7337|0.8818|0.9544|0.9592|0.8875|||0.9028|0.7227|0.8276|0.9554|0.9063|0.9541|0.9411|0.9142|0.9069|0.9592|EUR||0&1|0&1||||||HC|||||||'. using LOW for [u'frameshift_variant', u'start_lost', u'start_retained_variant']
Please report this on github with the effect-string above

It seems vcf2db has a problem with some data inside the CSQ tag from VEP. I also noticed not all data from the CSQ tag makes it into the db. Especially annoying for us is that for some reason, the canonical column is nowhere to be found, even though it's in the VCF.

I've included a sample vcf, so you might be able to replicate the error.
The vcf was created using bcbio v1.0.9

Cheers and thanks for the help
M

@matthdsm
Copy link
Contributor Author

matthdsm commented Apr 24, 2018

Correction, the missing column only occurs in db's created by bcbio!
Ping @chapmanb. Do you have any idea where this could occur? When manually creating a db from the decomposed vcf in the final dir, all data is present, but the pregenerated db's are missing the canonical column...

M

@brentp
Copy link
Member

brentp commented Apr 24, 2018

could be that bcbio needs to update either vcf2db or geneimpacts ?

@matthdsm
Copy link
Contributor Author

Hi Brent,

Thanks for the reply. I don't think so, bcbio has the latest versions available in conda...

# Name                    Version                   Build  Channel
vcf2db                    2018.01.23               py27_0    bioconda
geneimpacts               0.3.4                    py27_0    bioconda

M

@chapmanb
Copy link
Contributor

Matthias and Brent;
Sorry about the issue and thanks for cc'ing me into the conversation. I'm not sure why you'd see a different between bcbio vcf2db.py runs and those done by hand. bcbio isn't doing much fancy here beyond calling the command:

https://github.com/bcbio/bcbio-nextgen/blob/1083b6768511c20c5d006b0a49758fc6d3bb2167/bcbio/variation/population.py#L67

If you look in log/bcbio-nextgen-commands.log, is the same command being run that you're testing outside? Are there any other differences that could help us explore what is happening? Thanks for the help debugging.

@matthdsm
Copy link
Contributor Author

Hi Brent, Brad.

I did some more digging, and I think the error is coming from the geneimpacts module.
Our HPC cloud has:

#packages
-bash-4.2$ conda list vcf2db
# packages in environment at /user/data/gent/gvo000/gvo00082/vsc41443/bcbio/anaconda:
#
# Name                    Version                   Build  Channel
vcf2db                    2018.01.23               py27_0    bioconda
-bash-4.2$ conda list geneimpacts
# packages in environment at /user/data/gent/gvo000/gvo00082/vsc41443/bcbio/anaconda:
#
# Name                    Version                   Build  Channel
geneimpacts               0.3.4                    py27_0    bioconda

This is the setup that does NOT throw the annotation errors, but lacks the canonicalcolumn.

Our local cloud has:

[login] matdsmet:final $ conda list vcf2db                                                                                                                                                               [8:11:44]
 # packages in environment at /Shared/Software:
#
# Name                    Version                   Build  Channel
vcf2db                    2018.01.23               py27_0    bioconda
[login] matdsmet:final $  conda list geneimpacts                                                                                                                                                         [8:12:27]
# packages in environment at /Shared/Software:
#
# Name                    Version                   Build  Channel
geneimpacts               0.3.1                    py27_0    bioconda

Which DOES throw the annotation errors, but includes all columns.

Hope this helps to pinpoint the bug.

Cheers
M

@matthdsm
Copy link
Contributor Author

Looking at the geneimpacts repo, it probably has something to do with this PR
brentp/geneimpacts#13

@matthdsm
Copy link
Contributor Author

Just tested and can confirm, the missing column only occurs with geneimpacts 0.3.4.
I ran a test with 0.3.3, which fixed all errors.

Cheers
M

@chapmanb
Copy link
Contributor

cc'ing in Rory (@roryk) on this since it was his PR. Rory, do you know what might be going on here?

@roryk
Copy link

roryk commented Apr 27, 2018

Sorry about that folks and thanks for tracking down the problem, I'll take a look.

@roryk
Copy link

roryk commented Apr 30, 2018

Thanks, for the reproducible example, I'm fixing this now.

@roryk
Copy link

roryk commented Apr 30, 2018

This seems to work fine with geneimpacts 0.3.4 and vcf2db 2017.12.11 and 2018.01.23:

vcf2db.py sample1_chr1.vcf.gz ped.ped sample1.db
/home/rdk4/local/share/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:219: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
3057 variant_impacts:43730	effects time: 13.7	chunk time:26.3	116.26 variants/second
indexing ... finished in 0.6 seconds...
total time: in 27.4 seconds...
>>> import geneimpacts
>>> geneimpacts.__version__
'0.3.4'
>>>
conda list vcf2db
vcf2db                    2017.12.11               py27_0    bioconda

Same result with:

vcf2db                    2018.01.23               py27_0    bioconda

@roryk
Copy link

roryk commented Apr 30, 2018

Oh I see, it doesn't throw the errors with 0.3.4 but is missing the canonical column. I can read.

@roryk
Copy link

roryk commented Apr 30, 2018

AFAICT, It looks like vcf2db isn't using the CANONICAL column at all. In that VCF file there are 1756 variants labelled as CANONICAL:

gzip -cd sample1_chr1.vcf.gz | grep -v '^#' | cut -f26 -d'|'  | sort | uniq -c
   1301
   1756 YES

I opened two pull requests, one to have geneimpacts set is_canonical instead of canonical (brentp/geneimpacts#14) and one to have vcf2db.py set the is_canonical column: (#43) for VEP-annotated variants.

Loading the database like this:

python /n/app/bcbio/dev/rory-dev/vcf2db/vcf2db.py sample1_chr1.vcf.gz ped.ped sample1.db
gemini query -q "select is_canonical from variants" sample1.db | sort | uniq -c
   1173 0
   1884 1

I'm not super sure where the discrepancy is coming from.

@brentp
Copy link
Member

brentp commented Apr 30, 2018

not sure I understand how you go more canonicals in variants that you do in the VCF.

but, I don't see where you are telling the geneimpacts classes to prioritize_canonical.

@brentp
Copy link
Member

brentp commented Apr 30, 2018

for your first command, counting in the vcf, I would use:

bcftools query -f "%INFO/CSQ" $vcf | awk 'BEGIN{FS="|"}(NF > 25) { print $26 }'

or something like that to make sure you're not getting missing columns.

@roryk
Copy link

roryk commented Apr 30, 2018

Thanks, Brent, looks similar:

bcftools query -f %INFO/CSQ"\n" sample1_chr1.vcf.gz | awk 'BEGIN{FS="|"}(NF > 25) { print $26 }' | sort | uniq -c
   1301
   1756 YES

I was trying to fix the issue where the canonical status isn't populated to the variants table, even though it is set. So variants aren't necessarily prioritized by canonical status, but the status doesn't show up in the table anyway.

@brentp
Copy link
Member

brentp commented Apr 30, 2018

what does your vcf header for CSQ look like?

@roryk
Copy link

roryk commented Apr 30, 2018

Heya, this is the VCF linked here from further up in the issue:

https://github.com/quinlan-lab/vcf2db/files/1941507/sample1_chr1.vcf.gz

@roryk
Copy link

roryk commented Apr 30, 2018

Duh, I see-- some of the higher impact variants are also canonical, which affects what is loaded in the main table. I pushed another change to vcf2db.py to include a prioritize_canonical flag to actually use the canonical prioritization when loading up the database, so now there are two changes. If canonical is set, add it as a field to the main variants table. If it is set, and prioritize_canonical is set, prefer canonical variants above all else. Feel free to thumbs down all of these changes, I just rabbitholed on the original issue.

@brentp
Copy link
Member

brentp commented Apr 30, 2018

also, I think your is_canonical should be:

      @property
      def is_canonical(self):                                                                                                                                                   
         return self.effects.get("CANONICAL", "") != "" 

so that it is forced to return a boolean. otherwise, I get an error because it's using "YES"

@matthdsm
Copy link
Contributor Author

matthdsm commented Jun 5, 2018

Hi Brent,

Sorry to bother you again, but this issue still isn't fixed. I get the same error using the latest version of vcf2db/geneimpacts.

Are there any logs of whatever I could provide to help debugging?
M

@brentp
Copy link
Member

brentp commented Jun 5, 2018

@matthdsm can you help me debug. I did this:

brentp@lorax:~/src/vcf2db$ python -c "import geneimpacts; print geneimpacts.__version__"
0.3.5
brentp@lorax:~/src/vcf2db$ python vcf2db.py sample1_chr1.vcf.gz sample1.ped x.db 
/data/gemini_install/data/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value 'fam1'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
/data/gemini_install/data/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
3057 variant_impacts:43730	effects time: 13.3	chunk time:24.9	122.76 variants/second
indexing ... finished in 0.1 seconds...
total time: in 25.4 seconds...
brentp@lorax:~/src/vcf2db$ echo ".schema variants" | sqlite3 x.db  | grep canon
	is_canonical BOOLEAN, 
	CHECK (is_canonical IN (0, 1)), 
brentp@lorax:~/src/vcf2db$ echo "select is_canonical from variants;" | sqlite3 x.db  | sort | uniq -c
   1173 0
   1884 1
brentp@lorax:~/src/vcf2db$ echo "select is_canonical from variant_impacts;" | sqlite3 x.db  | sort | uniq -c
 35630 0
   8100 1
brentp@lorax:~/src/vcf2db$ python count-canon.py sample1_chr1.vcf.gz
[35630, 8100]

So count-canon.py shows that what's in the database matches what's in the VCF. I'm probably missing something obvious, but I don't know what is the problem you're encountering.

here is count-canon.py:

from cyvcf2 import VCF
import sys

ch = " Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|REFSEQ_MATCH|SOURCE|GIVEN_REF|USED_REF|BAM_EDIT|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|SpliceRegion".split("|") 

canons = [0, 0]
for v in VCF(sys.argv[1]):
    csq = v.INFO.get("CSQ")
    if csq is None:
        canons[0] += 1
    else:
        for c in csq.split(","):
            d = dict(zip(ch, c.split("|")))
            canons[int(d.get("CANONICAL") == "YES")] += 1

print canons

@matthdsm
Copy link
Contributor Author

matthdsm commented Jun 6, 2018

Hi Brent,

I got the following error:

Traceback (most recent call last):
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 920, in <module>
    impacts_extras=a.impacts_field, aok=a.a_ok)
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 232, in __init__
    self.load()
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 317, in load
    i = self._load(self.cache, create=True, start=1)
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 310, in _load
    self.insert(variants, expanded, keys, i, create=create)
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 372, in insert
    vilengths, variant_impacts)
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 400, in _insert
    self.__insert(v_objs, self.metadata.tables['variants'].insert())
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 434, in __insert
    raise e
sqlalchemy.exc.StatementError: (exceptions.TypeError) Not a boolean value:

This error occurs when using the bioconda version of geneimpacts. I noticed this version doesn't include @roryk's latest commits.
Would it work for you to update the latest release to include that commit?

When using a manual install of geneimpacts, this error doesn't occur AND the canonical col is fixed.

I do notice that the output of the SQL query and count-canon.py don't match..

[10:04:15] matdsmet:HSQ_136 $ echo "select is_canonical from variants;" | sqlite3 D1711586-joint-gatk-haplotype.db  | sort | uniq -c 
10974 0
21276 1
[10:04:30] matdsmet:HSQ_136 $ echo "select is_canonical from variant_impacts;" | sqlite3 D1711586-joint-gatk-haplotype.db  | sort | uniq -c 
401535 0
89323 1
[10:05:26] matdsmet:HSQ_136 $ python count-canon.py D1711586-joint-gatk-haplotype-joint-annotated-decomposed.vcf.gz 
[490858, 0]

Thanks for the help
M

@matthdsm
Copy link
Contributor Author

matthdsm commented Jun 6, 2018

For reference:

[10:10:13] matdsmet:HSQ_136 $ sqlite3 --version
3.23.1 2018-04-10 17:39:29 4bb2294022060e61de7da5c227a69ccd846ba330e31626ebcd59a94efd148b3b
[10:10:19] matdsmet:HSQ_136 $ python --version
Python 2.7.14 :: Anaconda, Inc.
[10:11:26] matdsmet:HSQ_136 $ python -c "import cyvcf2; print cyvcf2.__version__"            
0.8.4

@chapmanb
Copy link
Contributor

chapmanb commented Jun 6, 2018

Matthias;
Thanks for the catch. It looks like the latest release (0.3.5) that's in bioconda is missing the last two commits (it's tagged on afa1af3 https://github.com/brentp/geneimpacts/commits/master), which I guess is providing the fix. Brent, if that makes sense with you, could you be willing to tag a 0.3.6 including those changes and I'll update the bioconda recipe. Thanks again for the help debugging this.

@brentp
Copy link
Member

brentp commented Jun 6, 2018

I bumped and tagged v0.3.6. thanks guys.

@brentp brentp closed this as completed Jun 6, 2018
chapmanb added a commit to chapmanb/bioconda-recipes that referenced this issue Jun 6, 2018
@chapmanb
Copy link
Contributor

chapmanb commented Jun 6, 2018

Nice, thank you Brent. I bumped the bioconda package with this update. Matthias -- hope this gets everything working cleanly for you. Thanks again for the help debugging.

chapmanb added a commit to bioconda/bioconda-recipes that referenced this issue Jun 6, 2018
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants