-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
vcf2db cli error #42
Comments
Correction, the missing column only occurs in db's created by bcbio! M |
could be that bcbio needs to update either vcf2db or geneimpacts ? |
Hi Brent, Thanks for the reply. I don't think so, bcbio has the latest versions available in conda...
M |
Matthias and Brent; If you look in |
Hi Brent, Brad. I did some more digging, and I think the error is coming from the geneimpacts module.
This is the setup that does NOT throw the annotation errors, but lacks the Our local cloud has:
Which DOES throw the annotation errors, but includes all columns. Hope this helps to pinpoint the bug. Cheers |
Looking at the geneimpacts repo, it probably has something to do with this PR |
Just tested and can confirm, the missing column only occurs with geneimpacts 0.3.4. Cheers |
cc'ing in Rory (@roryk) on this since it was his PR. Rory, do you know what might be going on here? |
Sorry about that folks and thanks for tracking down the problem, I'll take a look. |
Thanks, for the reproducible example, I'm fixing this now. |
This seems to work fine with geneimpacts 0.3.4 and vcf2db 2017.12.11 and 2018.01.23: vcf2db.py sample1_chr1.vcf.gz ped.ped sample1.db
/home/rdk4/local/share/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:219: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
(util.ellipses_string(value),))
3057 variant_impacts:43730 effects time: 13.7 chunk time:26.3 116.26 variants/second
indexing ... finished in 0.6 seconds...
total time: in 27.4 seconds... >>> import geneimpacts
>>> geneimpacts.__version__
'0.3.4'
>>> conda list vcf2db
vcf2db 2017.12.11 py27_0 bioconda Same result with: vcf2db 2018.01.23 py27_0 bioconda |
Oh I see, it doesn't throw the errors with 0.3.4 but is missing the canonical column. I can read. |
AFAICT, It looks like vcf2db isn't using the CANONICAL column at all. In that VCF file there are 1756 variants labelled as CANONICAL: gzip -cd sample1_chr1.vcf.gz | grep -v '^#' | cut -f26 -d'|' | sort | uniq -c
1301
1756 YES I opened two pull requests, one to have geneimpacts set Loading the database like this: python /n/app/bcbio/dev/rory-dev/vcf2db/vcf2db.py sample1_chr1.vcf.gz ped.ped sample1.db gemini query -q "select is_canonical from variants" sample1.db | sort | uniq -c
1173 0
1884 1 I'm not super sure where the discrepancy is coming from. |
not sure I understand how you go more canonicals in variants that you do in the VCF. but, I don't see where you are telling the |
for your first command, counting in the vcf, I would use:
or something like that to make sure you're not getting missing columns. |
Thanks, Brent, looks similar:
I was trying to fix the issue where the canonical status isn't populated to the variants table, even though it is set. So variants aren't necessarily prioritized by canonical status, but the status doesn't show up in the table anyway. |
what does your vcf header for CSQ look like? |
Heya, this is the VCF linked here from further up in the issue:
|
Duh, I see-- some of the higher impact variants are also canonical, which affects what is loaded in the main table. I pushed another change to vcf2db.py to include a |
also, I think your
so that it is forced to return a boolean. otherwise, I get an error because it's using "YES" |
Hi Brent, Sorry to bother you again, but this issue still isn't fixed. I get the same error using the latest version of vcf2db/geneimpacts. Are there any logs of whatever I could provide to help debugging? |
@matthdsm can you help me debug. I did this:
So here is count-canon.py: from cyvcf2 import VCF
import sys
ch = " Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|REFSEQ_MATCH|SOURCE|GIVEN_REF|USED_REF|BAM_EDIT|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|SpliceRegion".split("|")
canons = [0, 0]
for v in VCF(sys.argv[1]):
csq = v.INFO.get("CSQ")
if csq is None:
canons[0] += 1
else:
for c in csq.split(","):
d = dict(zip(ch, c.split("|")))
canons[int(d.get("CANONICAL") == "YES")] += 1
print canons |
Hi Brent, I got the following error:
This error occurs when using the bioconda version of When using a manual install of I do notice that the output of the SQL query and
Thanks for the help |
For reference:
|
Matthias; |
I bumped and tagged v0.3.6. thanks guys. |
Nice, thank you Brent. I bumped the bioconda package with this update. Matthias -- hope this gets everything working cleanly for you. Thanks again for the help debugging. |
Hi Brent,
I got the following error using vcf2db:
It seems vcf2db has a problem with some data inside the CSQ tag from VEP. I also noticed not all data from the CSQ tag makes it into the db. Especially annoying for us is that for some reason, the
canonical
column is nowhere to be found, even though it's in the VCF.I've included a sample vcf, so you might be able to replicate the error.
The vcf was created using bcbio v1.0.9
Cheers and thanks for the help
M
The text was updated successfully, but these errors were encountered: