-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
How do I use from_data for ancestral annotation? #13
Comments
I have a followup question to this: how would I specify |
Hej Per, The easiest options would be to convert to a normal VCF file and then to use I recommend using at least 2 outgroup individuals which are not very closely related to each other. Power is rather weak when using only one. As for the I hope this helps |
Hi, thanks for the feedback. I should have mentioned that I already successfully have run the annotation on VCF input, but I would like to have a working implementation for VCF Zarr. One reason is there are downstream methods, such as ARG-based inference (e.g., tsinfer), that take VCF Zarr as input, and Zarr does confer some advantages to VCF. Applying the annotation to Zarr input obviates the need to generate new VCF files; the output can be stored in the analysis-ready Zarr archive. I guess one route I could try would be to model an annotator class on MaximumLikelihoodAncestralAnnotation but take Zarr as input. It should mainly be a matter of iterating sites and convert them to |
When you say "2 outgroup individuals" - can they be from the same population or are you referring to evolutionary distant species/subspecies? |
I see! You could probably subclass
They should be two evolutionary distant species. Otherwise the estimated branch length / rate separating the two outgroups will be very short / low, providing little extra information. |
I think having a general implementation to iterate a VCF Zarr archive and generate a |
Yes, I agree! Such an implementation would be useful in general and increase compatibility of VCF Zarr. I can also allow for a |
Hi,
I want to apply ancestral annotation to a dataset stored in VCF Zarr format. Following the documentation, one way to achieve this is prepare data and use the
from_data
function. I'm slightly confused about some of the options though, and I hope you can help me out.IIUC for each site I need to pass
n_major
,major_base
,minor_base
,outgroup_bases
, andn_ingroups
tofrom_data
. My dataset consists of some 500 samples, and I setn_ingroups=10
. Do I understand it correctly then that I myself need to do the subsampling for each site? What I'm doing is the following: for each site, I have genotype calls in numpy arrays shape(samples, ploidy)
that I flatten. I then subsample the list to lengthn_ingroups
. From this subsample I then determinen_major
,major_base
, andminor_base
. Therefore, no probabilistic sampling will occur later on as this subsample will be treated as a fix observation for this site. Is this the correct interpretation?Another thought I had was also to utilize information from multiple outgroup samples from the same species. Currently I'm selecting one individual as the outgroup sample, but I guess one could also just sample probabilistically an outgroup allele from a number of outgroup samples, such that one reduces any bias introduced by having selected an outgroup individual with a heterozygote (or even homozygote ALT call) where all other individuals are homozygote REF. This was actually how I first interpreted how the outgroups should be defined (I admittedly read the docs poorly...), but realized something was wrong when the optimization step failed to converge (I have 10 outgroup individuals).
Any help would be much appreciated.
Cheers,
Per
The text was updated successfully, but these errors were encountered: