Skip to content

Multi-way stats don't account for (self, self) pairs #2038

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
hyanwong opened this issue Dec 8, 2021 · 2 comments · Fixed by #2040
Closed

Multi-way stats don't account for (self, self) pairs #2038

hyanwong opened this issue Dec 8, 2021 · 2 comments · Fixed by #2040
Labels
documentation Documentation

Comments

@hyanwong
Copy link
Member

hyanwong commented Dec 8, 2021

The docs for ts.divergence say it gives "the average across distinct, randomly chosen pairs of chromosomes...." (emphasis mine), and also say ""Note that computing the divergence of a population to itself gives the mean pairwise nucleotide diversity within that population". But I think this isn't being done when a population is compared against itself, at least when mode="branch"

ts = msprime.sim_ancestry(10)
ts = msprime.sim_mutations(ts, rate=0.1)
assert ts.divergence([ts.samples(), ts.samples()], mode="branch") == ts.diversity(ts.samples(), mode="branch")  # Fails

And also if there's only one sample in the set, and it's compared against itself, it should give NaN (as diversity does, modulo the bug at #2037 ). But it doesn't:

print([
    ts.divergence([[0], [0]], mode="branch"),
    ts.diversity([0], mode="branch"),
])  # gives [0.0, nan]

First mentioned at https://github.com/tskit-dev/tskit/discussions/2035

@hyanwong hyanwong added the bug Something isn't working label Dec 8, 2021
@petrelharp
Copy link
Contributor

Ok, first note that if you compute divergence of a sample set to itself then you do indeed get nan:

ts.divergence([[0], [0]], indexes=[(0, 0), (0, 1)])
# array([nan,  0.])

... so, we are not checking for overlap between distinct sample sets and removing the self comparisons from those. I agree, this is not exactly what is implied by what's in that sentence.

This was intentional: otherwise, what you have is not a function of the allele counts in each sample set. For instance, if you wanted to compute "divergence" between [0, 1, 2] and [1, 2, 3] then what you'd need to do, effectively, is compute divergence([[0], [1, 2], [3]]) and diversity([[1, 2]]) and then combine them appropriately. Since AFAIK, the only time people compute divergence between overlapping sample sets is when those sample sets are identical (i.e., they are computing 'diversity'), this is way too much unneeded complexity.

So - perhaps just a clarification in the docs is needed? Would it work to just delete the word "distinct" and rely on the earlier note saying that computing divergence of a pop to itself gives diversity?

@petrelharp petrelharp added documentation Documentation and removed bug Something isn't working labels Dec 8, 2021
@hyanwong
Copy link
Member Author

hyanwong commented Dec 8, 2021

Ooo, this is rather tricky isn't it. I'll see if I can word something, but I suspect a note somewhere to clarify might help. Something along the lines of

"Note that comparing a sample set to itself by setting indexes =[(i, i)] is not quite the same as comparing a sample set to itself by specifying sample_sets=[set_A, set_A]. In the first case, when the same sample set indexes are specified, pairs of distinct samples are taken (like when calculating diversity). In the second case, the samples need not be distinct, and so the mean divergence will also include pairs where a sample is compared against itself (the divergence for that pair being 0)."

Or something like that

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
documentation Documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants