Skip to content

Weird behaviour of ts.diversity with a single sample #2037

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
hyanwong opened this issue Dec 8, 2021 · 5 comments
Closed

Weird behaviour of ts.diversity with a single sample #2037

hyanwong opened this issue Dec 8, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@hyanwong
Copy link
Member

hyanwong commented Dec 8, 2021

It should be impossible to calculate the diversity of a single sample, because we need to pick at least one pair of distinct samples to calculate it. So this is expected:

ts = msprime.sim_ancestry(10, ploidy=1)
ts = msprime.sim_mutations(ts, rate=0.1)
print(ts.diversity([0], mode="branch"))  # prints "array(nan)"

But looks what happens if there's only one sample in the ts:

ts = msprime.sim_ancestry(1, ploidy=1)
ts = msprime.sim_mutations(ts, rate=0.1)
print(ts.diversity([0], mode="branch"))  # prints "array(0.)"  <- er, what?
@petrelharp
Copy link
Contributor

Doesn't this fall into the category of "stats don't deal with missing data yet"? Since a ts with one sample is all missing data, by definition?

@jeromekelleher
Copy link
Member

This has two samples though @petrelharp, msprime is diploid by default now

@hyanwong
Copy link
Member Author

hyanwong commented Dec 8, 2021

This has two samples though @petrelharp, msprime is diploid by default now

Ah - not in this example (I set ploidy to 1). But yes, maybe that's why. Although presumably it has a mutation above it, so is not strictly "missing"?

I wonder if it works if there's topology and a unary root above the sample, though.

@jeromekelleher
Copy link
Member

Doh, @petrelharp is right of course. This is in the realm of missing data, and we don't deal with missing data at all in the stats framework.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 8, 2021

I wonder if it works if there's topology and a unary root above the sample, though.

Yes, it's fine if there's topology above the single sample node. It's not fine, however, if the node is isolated, but with a mutation above it (i.e. is not missing):

ts = msprime.sim_ancestry(2, ploidy=1, random_seed=5)
ts = msprime.sim_mutations(ts, rate=1, random_seed=4).simplify([0], keep_input_roots=False, filter_sites=False)
print(ts.draw_text())
print(ts.num_sites, ts.num_mutations, [v.genotypes for v in ts.variants()])
print(ts.diversity([0], mode="branch"))  # <- should be nan

gives

0.00┊ 0 ┊
    0   1

1 2 [array([2], dtype=int8)]
0.0

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants