Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

remove restriction of 10 alleles max in multiallelic sites #11

Merged
merged 5 commits into from
Jan 9, 2024

Conversation

akotlar
Copy link
Collaborator

@akotlar akotlar commented Jan 4, 2024

  • Simplifies makeHetHomozygotes and removes the rune-by-rune reading in favor of string by string, which enables us to have unlimited numbers of alleles. This is useful because dbSNP VCF format has many submissions with > 9 alleles

Previously we used a fast parsing strategy, which took a string like "0|1" and read it as runes (ascii code points) '0', '|', and '1'. This allowed us to avoid splitting on "|" and to also allow us to not split on ":" should there be metadata about the genotypes (e.g. GT:AD).

Now, we do the necessary splitting, and evaluate each allele as a string, which may be composed of multiple runes, allowing genotypes like "100|0" to be evaluated (for a hypothetical site with >= 100 alternate alleles).

We still keep a fast path for biallelic sites, entirely avoiding the need to do any processing besides exact match on the most common genotypes (0|0, 0/0, 1|0, 1/0, 0/1, 0|1).

Addresses #7

@akotlar akotlar requested a review from cristinaetrv January 4, 2024 02:19
@akotlar akotlar requested a review from wingolab January 4, 2024 02:41
wingolab
wingolab previously approved these changes Jan 8, 2024
Copy link
Collaborator

@wingolab wingolab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there's a test case for the multiallelic with > 10 sites. Logic seems fine.

FYI, I needed to do the following to test it locally.

go mod init github.com/bystrogenomics/bystro-vcf
go mod tidy
go test -v ./...

continue
}

// We don't support haploid genotypes very well; I will count such sites
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the new version going to represent haploid genotypes as haploid or still homozygous?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will still be homozygous; in the for loop, each genotype value adds 1 count to the genotype count; if that genotype value matches the alleleNum (the multi allelic allele in the decomposed multi allelic, where alleleNum will be the only ALT allele on that line in the output of bystro-vcf), then alt also gets a count of 1.

In the haploid case, if the only genotype value matches the alleleNum, that site will be homozygous/hemizygous, since altCount == 1 and gtCount == 1

@akotlar
Copy link
Collaborator Author

akotlar commented Jan 9, 2024

I see there's a test case for the multiallelic with > 10 sites. Logic seems fine.

FYI, I needed to do the following to test it locally.

go mod init github.com/bystrogenomics/bystro-vcf
go mod tidy
go test -v ./...

Ah, I forgot to add the go.mod and go.sum, thanks

Copy link
Collaborator

@cristinaetrv cristinaetrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@akotlar akotlar merged commit 6c736a2 into master Jan 9, 2024
@akotlar akotlar deleted the feature/more-than-10-alleles branch January 9, 2024 23:59
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants