-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
remove restriction of 10 alleles max in multiallelic sites #11
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there's a test case for the multiallelic with > 10 sites. Logic seems fine.
FYI, I needed to do the following to test it locally.
go mod init github.com/bystrogenomics/bystro-vcf
go mod tidy
go test -v ./...
continue | ||
} | ||
|
||
// We don't support haploid genotypes very well; I will count such sites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the new version going to represent haploid genotypes as haploid or still homozygous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will still be homozygous; in the for loop, each genotype value adds 1 count to the genotype count; if that genotype value matches the alleleNum (the multi allelic allele in the decomposed multi allelic, where alleleNum will be the only ALT allele on that line in the output of bystro-vcf), then alt also gets a count of 1.
Line 1017 in dfb9065
gtCount++
In the haploid case, if the only genotype value matches the alleleNum, that site will be homozygous/hemizygous, since altCount == 1
and gtCount == 1
Ah, I forgot to add the go.mod and go.sum, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm!
Previously we used a fast parsing strategy, which took a string like "0|1" and read it as runes (ascii code points) '0', '|', and '1'. This allowed us to avoid splitting on "|" and to also allow us to not split on ":" should there be metadata about the genotypes (e.g. GT:AD).
Now, we do the necessary splitting, and evaluate each allele as a string, which may be composed of multiple runes, allowing genotypes like "100|0" to be evaluated (for a hypothetical site with >= 100 alternate alleles).
We still keep a fast path for biallelic sites, entirely avoiding the need to do any processing besides exact match on the most common genotypes (0|0, 0/0, 1|0, 1/0, 0/1, 0|1).
Addresses #7