Pair coalescence rates over time when there are polytomies #3125

hyanwong · 2025-03-30T09:04:40Z

hyanwong
Mar 30, 2025
Maintainer

@nspope developed the nice routines that calculate pair coalescence rates over time. However, when run on an inferred tree sequence (especially when dated with match_segregating_sites=False), any polytomies in the local trees represent the coalescence of the oldest node in the polytomy, and hence the rates do not fit expectations from genetic diversity, etc.

On a tree-by-tree basis, I wonder if it would be possible to "smooth" the coalescence rates in the presence of polytomies such that the rate at the polytomy is smooshed towards more recent time, using the coalescent as a model. More specifically, take the shortest edge under a polytomy, and distribute the rate exponentially from the child to the parent time, rather than putting all the weight at the parent.

E.g. in the following tree, instead of putting all the coalescences for node 13 at time 2, the average coalescence rate would be distributed between times 1 and 2 as if there were a 4-tip coalescent process between those times.

If we are only worried about rates, and not which lineages are actually coalescing at any one point, this seems like a reasonable thing to do. I wonder if the adjustment to the rate calculator is an easy one to make or not (I'm not sure what the algorithm would be to do this over the entire tree sequence: I can only picture it tree-to-tree at the moment, so it could be rather slow.

hyanwong · 2025-03-30T10:58:14Z

hyanwong
Mar 30, 2025
Maintainer Author

If we could also incorporate variance into the smooshed coalescent time estimates, and also uncertainty in the node times, we would have a way of integrating over dating uncertainty and, to some extent, topological uncertainty too (because of the polytomies).

0 replies

nspope · 2025-03-30T19:14:33Z

nspope
Mar 30, 2025
Collaborator

I'm definitely sympathetic to trying to improve estimates where there are polytomies. However, I'm not too keen on the idea of complexifying the current coalescence rate API. The idea here is foremost to provide the raw ingredients for coalescence rates via pair_coalescence_counts -- that is: the expected number of coalescing pairs per node, with expectation taken over the sequence. These can be used as weights in an empirical approximation to the PDF of pair coalescence times (e.g. if you take each node time as a weighted observation). The pair_coalescence_rates routine is a wrapper over this that computes a simple, transparent estimator of the rate in time windows.

You could imagine much better estimators could come from taking the raw weights and doing smoothing somehow (like fitting a weighted KDE and evaluating the survival function) but these of course will involve modelling decisions. And, this could be done by downstream functions using the output of pair_coalescence_counts (because you can get node-level outputs from pair_coalescence_counts that don't depend on assigned times whatsoever).

So my opinion is that the focus of the tskit pair_coalescence_XXXX API should be on providing simple/transparent statistics that may be used by separate packages downstream, that might have more specialized use cases (like: more accurate rate estimation where there are polytomies, by fitting a KDE or model or what-have-you using the weights produced by pair_coalescence_counts). I'm happy to help work out these downstream methods. (And, I feel like a good first step is making some documentation that shows how pair_coalescence_counts may be used to approximate the pair coalescence time PDF).

3 replies

nspope Mar 30, 2025
Collaborator

As an addendum: the way pair_coalescence_XXXX works internally is by computing the per-node (or per-time-window) counts in an incremental way, and applying a summary function func(node_times, node_counts) at the end of each genomic window to collapse the output (as this could be unreasonably large if the desired summary needs to act on the per-node times/weights). We'd talked about allowing an arbitrary summary function to be passed from the python API (like with the general_stat interface), which would facilitate things like the smoothing you describe above. But, this will take quite a bit of work.

hyanwong Mar 31, 2025
Maintainer Author

Thanks Nate. I wasn't necessarily suggesting changing the existing API / algorithm. As you suggest, it might be possible to figure out how values returned from current method could be used to provide the sort of downwards smoothing I describe. As you say, this is also pretty model-based, and would take some work to figure out, IMO. But I definitely think it's an idea worth exploring theoretically.

nspope Mar 31, 2025
Collaborator

Certainly -- I think starting with tree-by-tree is the way to go; then it may be obvious how to refactor into an incremental algorithm (as well as giving an indication if the idea works or not).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pair coalescence rates over time when there are polytomies #3125

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Pair coalescence rates over time when there are polytomies #3125

hyanwong Mar 30, 2025 Maintainer

Replies: 2 comments · 3 replies

hyanwong Mar 30, 2025 Maintainer Author

nspope Mar 30, 2025 Collaborator

nspope Mar 30, 2025 Collaborator

hyanwong Mar 31, 2025 Maintainer Author

nspope Mar 31, 2025 Collaborator

hyanwong
Mar 30, 2025
Maintainer

Replies: 2 comments 3 replies

hyanwong
Mar 30, 2025
Maintainer Author

nspope
Mar 30, 2025
Collaborator

nspope Mar 30, 2025
Collaborator

hyanwong Mar 31, 2025
Maintainer Author

nspope Mar 31, 2025
Collaborator