-
Notifications
You must be signed in to change notification settings - Fork 76
Deal with missing data in stats #287
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Note that we might need to do something with isolated samples with a mutation above them (which are not meant to be missing). See #2037 (comment) |
Should we currently error out if we detect missing data in a stats calculation, until this issue is fixed? |
We probably should, putting into tskit 0.4.1 |
A demo example from https://github.com/hyanwong/ancestor-PCA/issues/1: you might hope to get the same pairwise information out of all these 3 tree sequences (in the last, each pair occurs in each tree) |
Revisiting this, also the branch-length distance between two samples that are isolated from each other in the topology should (IMO) be NaN or infinity. But currently, e.g. two isolated samples have a distance of 0 between them: empty_ts = tskit.Tree.generate_comb(3).tree_sequence.delete_intervals([[0, 1]])
assert empty_ts.divergence([[0],[1]], mode="branch") == 0 This presumably applies to a number of other branch-length stats too. |
I'm hitting this again when working with the new 1000 genome inferred tree sequences: import tszip
ts = tszip.decompress("1kgp-chr20p-filterTrimmedWithCpg-mm0-post-processed.trees.tsz")
print(ts.diversity(), ts.trim().diversity()) Gives the following (the second number is the correct one)
The first number is over 50% smaller than it should be because the chr20p tree sequence only covers 40% of the total sequence length. This is pretty confusing, IMO. How difficult would it be to raise an error if any of the trees have no edges?
|
It's not immediately obvious to me how it should be done, but you're right we should be raising an error for this. |
The possibility of missing data is being added in #272; the stats need to be modified to take this into account. At least two issues:
A nice thing about the current definitions is that many stats won't need modifying the numerator to account for missing data.
The text was updated successfully, but these errors were encountered: