Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

How do I use from_data for ancestral annotation? #13

Open
percyfal opened this issue Dec 3, 2024 · 7 comments
Open

How do I use from_data for ancestral annotation? #13

percyfal opened this issue Dec 3, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@percyfal
Copy link

percyfal commented Dec 3, 2024

Hi,

I want to apply ancestral annotation to a dataset stored in VCF Zarr format. Following the documentation, one way to achieve this is prepare data and use the from_data function. I'm slightly confused about some of the options though, and I hope you can help me out.

IIUC for each site I need to pass n_major, major_base, minor_base, outgroup_bases, and n_ingroups to from_data. My dataset consists of some 500 samples, and I set n_ingroups=10. Do I understand it correctly then that I myself need to do the subsampling for each site? What I'm doing is the following: for each site, I have genotype calls in numpy arrays shape (samples, ploidy) that I flatten. I then subsample the list to length n_ingroups. From this subsample I then determine n_major, major_base, and minor_base. Therefore, no probabilistic sampling will occur later on as this subsample will be treated as a fix observation for this site. Is this the correct interpretation?

Another thought I had was also to utilize information from multiple outgroup samples from the same species. Currently I'm selecting one individual as the outgroup sample, but I guess one could also just sample probabilistically an outgroup allele from a number of outgroup samples, such that one reduces any bias introduced by having selected an outgroup individual with a heterozygote (or even homozygote ALT call) where all other individuals are homozygote REF. This was actually how I first interpreted how the outgroups should be defined (I admittedly read the docs poorly...), but realized something was wrong when the optimization step failed to converge (I have 10 outgroup individuals).

Any help would be much appreciated.

Cheers,

Per

@percyfal
Copy link
Author

percyfal commented Dec 3, 2024

I have a followup question to this: how would I specify n_target_sites in this setting? The Zarr archive only consists of polymorphic sites.

@Sendrowski
Copy link
Owner

Hej Per,

The easiest options would be to convert to a normal VCF file and then to use Annotator as described here. The ingroup and outgroup subsampling is done automatically in this case. Using MaximumLikelihoodAncestralAnnotation.from_data would be a bit cumbersome. Outgroup genotypes are subsampled randomly per site and outgroup individual. The hope is that there are not too many shard polymorphisms so that most outgroup individuals will be homozygous.

I recommend using at least 2 outgroup individuals which are not very closely related to each other. Power is rather weak when using only one.

As for the n_target_sites argument, there is a description in the API reference of MaximumLikelihoodAncestralAnnotation. It is mostly important to choose a sufficiently large value (i.e. 100 times the number of polymorphic sites). This is to make sure we don't assume a lot of repeat mutations when determining the most likely ancestral alleles (i.e. assuming an unrealistically high mutation rate by assuming all sites are polymorphic).

I hope this helps
Janek

@percyfal
Copy link
Author

percyfal commented Dec 3, 2024

Hi,

thanks for the feedback. I should have mentioned that I already successfully have run the annotation on VCF input, but I would like to have a working implementation for VCF Zarr. One reason is there are downstream methods, such as ARG-based inference (e.g., tsinfer), that take VCF Zarr as input, and Zarr does confer some advantages to VCF. Applying the annotation to Zarr input obviates the need to generate new VCF files; the output can be stored in the analysis-ready Zarr archive.

I guess one route I could try would be to model an annotator class on MaximumLikelihoodAncestralAnnotation but take Zarr as input. It should mainly be a matter of iterating sites and convert them to cyvcf2.cyvcf2.Variant and pass to _parse_variant. I'll see if I can come up with a solution.

@percyfal
Copy link
Author

percyfal commented Dec 3, 2024

When you say "2 outgroup individuals" - can they be from the same population or are you referring to evolutionary distant species/subspecies?

@Sendrowski
Copy link
Owner

Hi,

thanks for the feedback. I should have mentioned that I already successfully have run the annotation on VCF input, but I would like to have a working implementation for VCF Zarr. One reason is there are downstream methods, such as ARG-based inference (e.g., tsinfer), that take VCF Zarr as input, and Zarr does confer some advantages to VCF. Applying the annotation to Zarr input obviates the need to generate new VCF files; the output can be stored in the analysis-ready Zarr archive.

I guess one route I could try would be to model an annotator class on MaximumLikelihoodAncestralAnnotation but take Zarr as input. It should mainly be a matter of iterating sites and convert them to cyvcf2.cyvcf2.Variant and pass to _parse_variant. I'll see if I can come up with a solution.

I see! You could probably subclass MaximumLikelihoodAncestralAnnotation to add this functionality. Alternatively, from_data and from_dataframe could indeed be easier than mocking cyvcf2.cyvcf2.Variant. Unfortunately, probabilistic subsampling is not supported by these methods (cf. _subsample_site). One more consideration is that these two methods don't provide genomic positions, so _contig_bounds will have to be maintained to know which genomic intervals to sample monomorphic sites from. It could be worthwhile including support for VCF Zarr, should it become more widespread. A general implementation creating cyvcf2.VCF directly from VCF Zarr would also be interesting.

When you say "2 outgroup individuals" - can they be from the same population or are you referring to evolutionary distant species/subspecies?

They should be two evolutionary distant species. Otherwise the estimated branch length / rate separating the two outgroups will be very short / low, providing little extra information.

@percyfal
Copy link
Author

percyfal commented Dec 4, 2024

I think having a general implementation to iterate a VCF Zarr archive and generate a cyvcf2.VCF object would be the best way to go then as that would hopefully reduce the amount of changes to fastDFE to accommodate a new input type. MaximumLikelihoodAncestralAnnotation._reader could then be VCF | ZarrHandler | None where ZarrHandler iterates sites and returns cyvcf2.VCF. Zarr would then only be supported by the this annotation class and I guess you would want the entire package to support Zarr if this route is taken. I'll see if I can whip together an implementation for this particular use case as a starting point.

@Sendrowski
Copy link
Owner

Yes, I agree! Such an implementation would be useful in general and increase compatibility of VCF Zarr. I can also allow for a cyvcf2.VCF handle whenever a VCF file path can be specified. This should then also work for mocked VCF readers but would also require extending fastdfe.io_handlers.VCFHandler somewhat.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants