-
Notifications
You must be signed in to change notification settings - Fork 76
Multi-way stats don't account for (self, self) pairs #2038
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Ok, first note that if you compute divergence of a sample set to itself then you do indeed get nan:
... so, we are not checking for overlap between distinct sample sets and removing the self comparisons from those. I agree, this is not exactly what is implied by what's in that sentence. This was intentional: otherwise, what you have is not a function of the allele counts in each sample set. For instance, if you wanted to compute "divergence" between [0, 1, 2] and [1, 2, 3] then what you'd need to do, effectively, is compute divergence([[0], [1, 2], [3]]) and diversity([[1, 2]]) and then combine them appropriately. Since AFAIK, the only time people compute divergence between overlapping sample sets is when those sample sets are identical (i.e., they are computing 'diversity'), this is way too much unneeded complexity. So - perhaps just a clarification in the docs is needed? Would it work to just delete the word "distinct" and rely on the earlier note saying that computing divergence of a pop to itself gives diversity? |
Ooo, this is rather tricky isn't it. I'll see if I can word something, but I suspect a note somewhere to clarify might help. Something along the lines of "Note that comparing a sample set to itself by setting indexes =[(i, i)] is not quite the same as comparing a sample set to itself by specifying sample_sets=[set_A, set_A]. In the first case, when the same sample set indexes are specified, pairs of distinct samples are taken (like when calculating Or something like that |
The docs for
ts.divergence
say it gives "the average across distinct, randomly chosen pairs of chromosomes...." (emphasis mine), and also say ""Note that computing the divergence of a population to itself gives the mean pairwise nucleotide diversity within that population". But I think this isn't being done when a population is compared against itself, at least when mode="branch"And also if there's only one sample in the set, and it's compared against itself, it should give NaN (as
diversity
does, modulo the bug at #2037 ). But it doesn't:First mentioned at https://github.com/tskit-dev/tskit/discussions/2035
The text was updated successfully, but these errors were encountered: