Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Margin could not phase some HiFi variants with UL reads #22

Open
mobinasri opened this issue Jul 19, 2023 · 0 comments
Open

Margin could not phase some HiFi variants with UL reads #22

mobinasri opened this issue Jul 19, 2023 · 0 comments

Comments

@mobinasri
Copy link

For polishing HPRC diploid assemblies we need to phase the variants called by applying DeepVariant on the alignments of all HiFi reads to each haplotype. We are using WhatsHap and Margin to phase HiFi variants with Ultra Long reads (>100kb). Since Margin is multi-threaded it is boosting the speed of our pipeline versus WhatsHap which is not multi-threaded. So we are inclined to use Margin. However we noticed that there exist some variants phased by WhatsHap but not by Margin. I made a small set of three variants (per haplotype) to make this issue easily reproducible. I'm explaining the issues in more detail below and also attaching the IGV snapshots of the three variants left unphased in the paternal haplotype (hap1). The equivalent variants on the other haplotype (maternal or hap2) are having the same issue so I skip writing about them but they can be explored and investigated using the files and IGV sessions provided at the end of this post.

Examples:

  1. Location: HG002#1#JAHKSE010000019.1:39117238-39117397
    This variant is happening in a TG dimer repeat. There is a deletion of length two that has be phased here (genotype = TTG/T). As I mentioned above this variant is called from the HiFi alignments, which are not shown in the screenshot. The UL reads are shown in two groups. Just for more informative visualization these two groups are phased naively in IGV by the nearest SNP which is phased by both WhatsHap and Margin. As expected the UL reads are not very clean here however there exist some signal for phasing and WhatsHap could use it to phase the variant but Margin could not. Top window is showing the variant record in the WhatsHap output and the bottom one is for Margin.

Example_1_Hap1

  1. Location: HG002#1#JAHKSE010000052.1:29343390-29343498
    This variant similar to the first one is happening in a TG dimer repeat and the deletion is of length 2 (genotype = AGT/A). Again UL alignments are phased naively by the nearest phased SNP. Similar to the previous case it is not clean but there is some signal.

Example_2_Hap1

  1. Location: HG002#1#JAHKSE010000064.1:12689434-12689590
    In this case we have two variants beside each other. The first one is a 5-bp deletion with the genotype of ACAAAG/A and the other one is a 1-bp insertion with the genotype of G/GA. Again the UL reads are phased by the nearest SNP. WhatsHap could phase both variants correctly however Margin leaves the insertion unphased.

Example_3_Hap1

The related files such as the raw and phased vcf files, assembly fasta files, UL alignments and HiFi alignments are all uploaded here:
https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/04ab5ae4-0170-11ee-904c-0a13c5208311--HPRC_Polishing/HPRC_Y1/HG002/margin_issues/

"igv_session.xml" can be downloaded locally and loaded in IGV to explore the variants explained above.

Below are the WDL files used for running Margin and WhatsHap:
https://github.com/miramastoras/hpp_production_workflows/blob/79d7f972700ec7f87f0a920c1eb2cf34cc03c692/QC/wdl/tasks/marginPhase.wdl
https://github.com/miramastoras/hpp_production_workflows/blob/79d7f972700ec7f87f0a920c1eb2cf34cc03c692/QC/wdl/tasks/whatsHapPhase.wdl

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant