Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix vcf parsing #25

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Fix vcf parsing #25

wants to merge 1 commit into from

Conversation

xfengnefx
Copy link

Hi,

Phased variants is read from vcf file by finding "1|0" or "0|1" substring in each vcf records. This should be done only to the last column of a vcf record (in single sample vcf files), not the whole record. Similar for replacing the phasing.

Example: The following line is from a epi2me-labs/wf-human-variation + hapcut2 v1.3.1 run on this bam. The variant is unphased but line has a "0|1", which crashes the run by calling int() on a string:

chr6 145913508 . G A 25.38 PASS P;ANN=A|synonymous_variant|LOW|S HPRH|SHPRH|transcript|XM_017010691.2|protein_coding|24/30|c.4296C>T|p.Cys1432Cys|4457/5527|4296/ 5235|1432/1744||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcript|XM_006715439.4|protein_coding|2 4/31|c.4296C>T|p.Cys1432Cys|4457/11423|4296/5124|1432/1707||,A|synonymous_variant|LOW|SHPRH|SHPR H|transcript|XM_006715443.4|protein_coding|24/26|c.4296C>T|p.Cys1432Cys|4457/4780|4296/4524|1432 /1507||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcript|XM_017010693.2|protein_coding|24/31|c.42 96C>T|p.Cys1432Cys|4457/5304|4296/5073|1432/1690||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcri pt|XM_017010696.2|protein_coding|25/31|c.2853C>T|p.Cys951Cys|4074/5145|2853/3792|951/1263||,A|sy nonymous_variant|LOW|SHPRH|SHPRH|transcript|XM_024446394.1|protein_coding|25/31|c.2853C>T|p.Cys9 51Cys|4374/5445|2853/3792|951/1263||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcript|XM_01701069 2.1|protein_coding|24/30|c.4296C>T|p.Cys1432Cys|4695/5765|4296/5235|1432/1744||,A|synonymous_var iant|LOW|SHPRH|SHPRH|transcript|XM_024446393.1|protein_coding|25/31|c.3354C>T|p.Cys1118Cys|4590/ 5660|3354/4293|1118/1430||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcript|XM_011535719.3|protei n_coding|24/30|c.4296C>T|p.Cys1432Cys|4457/7072|4296/5034|1432/1677||,A|synonymous_variant|LOW|S HPRH|SHPRH|transcript|NM_001042683.3|protein_coding|24/30|c.4296C>T|p.Cys1432Cys|4956/7596|4296/ 5052|1432/1683||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcript|NM_001370327.1|protein_coding|2 4/30|c.4296C>T|p.Cys1432Cys|4530/7170|4296/5052|1432/1683||,A|synonymous_variant|LOW|SHPRH|SHPRH |transcript|NM_001370328.1|protein_coding|26/32|c.2853C>T|p.Cys951Cys|4114/6754|2853/3609|951/12 02||,A|synonymous_variant|LOW|SHPRH|SHPRH|transcript|NM_173082.4|protein_coding|24/30|c.4308C>T| p.Cys1436Cys|4968/7261|4308/4980|1436/1659||,A|downstream_gene_variant|MODIFIER|SHPRH|SHPRH|tran script|XR_002956273.1|pseudogene||n.*4666C>T|||||4666|,A|non_coding_transcript_exon_variant|MODI FIER|SHPRH|SHPRH|transcript|XR_942391.3|pseudogene|24/29|n.4457C>T||||||,A|non_coding_transcript _exon_variant|MODIFIER|SHPRH|SHPRH|transcript|XR_942393.3|pseudogene|24/29|n.4457C>T||||||,A|non _coding_transcript_exon_variant|MODIFIER|SHPRH|SHPRH|transcript|XR_942392.3|pseudogene|24/29|n.4 457C>T||||||,A|non_coding_transcript_exon_variant|MODIFIER|SHPRH|SHPRH|transcript|XR_942390.3|ps eudogene|24/29|n.4457C>T|||||| GT:GQ:DP:AD:AF:PS 1/1:25:89:0,86:0.9663:.

Thanks!

Phased variants is read from vcf file by finding 
"1|0" or "0|1" substring in each vcf records. 
This should be done only to the last column. 
This fix still assumes input is a single sample vcf.
@Fu-Yilei
Copy link
Collaborator

Thank you for catching this bug. If you have tested the modified code I can merge the PR.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants