Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Mykrobe overcounts homopol deletions when making probes #148

Open
iqbal-lab opened this issue Feb 24, 2022 · 3 comments
Open

Mykrobe overcounts homopol deletions when making probes #148

iqbal-lab opened this issue Feb 24, 2022 · 3 comments

Comments

@iqbal-lab
Copy link
Collaborator

The probe generator is making one probe for each 1bp and 2bp deletions in katG and pncA. But if you make a all 1bp deletions in a homopolymer, they are essentially the same.
eg deleting one A from AAA could happen in 3 places and give the same sequence.
This leads to annoying things where we report >1 mutation being detected, when it is basicvally the same thing detected twice/whatever.

Workaround suggestions

  1. compare all the indel/frame-shift probes against each other and dedup
  2. pay attention to whether you are in a homopolymer when creating the 1bp and 2bp deletions, and don't overcreate probes
@mbhall88
Copy link
Member

For context. We have this example

            "Isoniazid": {
                "predict": "R",
                "called_by": {
                    "katG_GC1037G-GC2155074C": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -711.0227088947968,
                            -417.9483717274669
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 7,
                                    "min_non_zero_depth": 6,
                                    "kmer_count": 175,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 15,
                                    "min_non_zero_depth": 13,
                                    "kmer_count": 253,
                                    "klen": 18
                                }
                            },
                            "expected_depths": [
                                24
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 293
                        },
                        "_cls": "Call.VariantCall"
                    },
                    "katG_CC1038C-GG2155073G": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -727.5071540961122,
                            -377.76816030583126
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 7,
                                    "min_non_zero_depth": 6,
                                    "kmer_count": 160,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 15,
                                    "min_non_zero_depth": 13,
                                    "kmer_count": 253,
                                    "klen": 18
                                }
                            },
                            "expected_depths": [
                                24
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 350
                        },
                        "_cls": "Call.VariantCall"
                    },
                    "katG_CC1039C-GG2155072G": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -729.808268296947,
                            -372.51398695693916
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 7,
                                    "min_non_zero_depth": 6,
                                    "kmer_count": 158,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 15,
                                    "min_non_zero_depth": 13,
                                    "kmer_count": 253,
                                    "klen": 18
                                }
                            },
                            "expected_depths": [
                                24
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 357
                        },
                        "_cls": "Call.VariantCall"
                    }
                }
            },

The three mutations in the JSON above could concieveably be the same deletion...or a 2bp deletion I guess.

GC1037G
CC1038C
CC1039C

I guess it probably is a single 1bp deletion - at 1038.

Although I'm a bit confused about how the mutations work. Which of the two bases is the one at the position?

If I pull out these three bases, with one base flanking, from the katG reference, I get G GCC T. So if the position describes the first base, then 1039 should read CT1039C right? But if the second base describes the position, it's the same problem right?

@martinghunt
Copy link
Member

The sequence at 1038-1040 in katG is CCC.
image

@mbhall88
Copy link
Member

Hmmm, wonder why I got a different sequence.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants