Multi-way stats don't account for (self, self) pairs #2038

hyanwong · 2021-12-08T17:47:41Z

The docs for ts.divergence say it gives "the average across distinct, randomly chosen pairs of chromosomes...." (emphasis mine), and also say ""Note that computing the divergence of a population to itself gives the mean pairwise nucleotide diversity within that population". But I think this isn't being done when a population is compared against itself, at least when mode="branch"

ts = msprime.sim_ancestry(10)
ts = msprime.sim_mutations(ts, rate=0.1)
assert ts.divergence([ts.samples(), ts.samples()], mode="branch") == ts.diversity(ts.samples(), mode="branch")  # Fails

And also if there's only one sample in the set, and it's compared against itself, it should give NaN (as diversity does, modulo the bug at #2037 ). But it doesn't:

print([
    ts.divergence([[0], [0]], mode="branch"),
    ts.diversity([0], mode="branch"),
])  # gives [0.0, nan]

First mentioned at https://github.com/tskit-dev/tskit/discussions/2035

The text was updated successfully, but these errors were encountered:

petrelharp · 2021-12-08T18:15:52Z

Ok, first note that if you compute divergence of a sample set to itself then you do indeed get nan:

ts.divergence([[0], [0]], indexes=[(0, 0), (0, 1)])
# array([nan,  0.])

... so, we are not checking for overlap between distinct sample sets and removing the self comparisons from those. I agree, this is not exactly what is implied by what's in that sentence.

This was intentional: otherwise, what you have is not a function of the allele counts in each sample set. For instance, if you wanted to compute "divergence" between [0, 1, 2] and [1, 2, 3] then what you'd need to do, effectively, is compute divergence([[0], [1, 2], [3]]) and diversity([[1, 2]]) and then combine them appropriately. Since AFAIK, the only time people compute divergence between overlapping sample sets is when those sample sets are identical (i.e., they are computing 'diversity'), this is way too much unneeded complexity.

So - perhaps just a clarification in the docs is needed? Would it work to just delete the word "distinct" and rely on the earlier note saying that computing divergence of a pop to itself gives diversity?

hyanwong · 2021-12-08T19:21:19Z

Ooo, this is rather tricky isn't it. I'll see if I can word something, but I suspect a note somewhere to clarify might help. Something along the lines of

"Note that comparing a sample set to itself by setting indexes =[(i, i)] is not quite the same as comparing a sample set to itself by specifying sample_sets=[set_A, set_A]. In the first case, when the same sample set indexes are specified, pairs of distinct samples are taken (like when calculating diversity). In the second case, the samples need not be distinct, and so the mean divergence will also include pairs where a sample is compared against itself (the divergence for that pair being 0)."

Or something like that

hyanwong added the bug Something isn't working label Dec 8, 2021

petrelharp added documentation Documentation and removed bug Something isn't working labels Dec 8, 2021

hyanwong mentioned this issue Dec 8, 2021

Remove "random" from doc wording #2040

Merged

petrelharp closed this as completed in #2040 Dec 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-way stats don't account for (self, self) pairs #2038

Multi-way stats don't account for (self, self) pairs #2038

hyanwong commented Dec 8, 2021

petrelharp commented Dec 8, 2021

hyanwong commented Dec 8, 2021 •

edited

Loading

Multi-way stats don't account for (self, self) pairs #2038

Multi-way stats don't account for (self, self) pairs #2038

Comments

hyanwong commented Dec 8, 2021

petrelharp commented Dec 8, 2021

hyanwong commented Dec 8, 2021 • edited Loading

hyanwong commented Dec 8, 2021 •

edited

Loading