Skip to content

Deal with missing data in stats #287

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
petrelharp opened this issue Aug 6, 2019 · 7 comments
Open

Deal with missing data in stats #287

petrelharp opened this issue Aug 6, 2019 · 7 comments
Labels
enhancement New feature or request
Milestone

Comments

@petrelharp
Copy link
Contributor

The possibility of missing data is being added in #272; the stats need to be modified to take this into account. At least two issues:

  1. Site stats: isolated samples will be assumed to have the ancestral allele, wrongly.
  2. All stats: the denominators should count up the number of nonmissings.

A nice thing about the current definitions is that many stats won't need modifying the numerator to account for missing data.

@jeromekelleher jeromekelleher added the enhancement New feature or request label Sep 29, 2020
@hyanwong
Copy link
Member

hyanwong commented Dec 8, 2021

Note that we might need to do something with isolated samples with a mutation above them (which are not meant to be missing). See #2037 (comment)

@hyanwong
Copy link
Member

hyanwong commented Dec 8, 2021

Should we currently error out if we detect missing data in a stats calculation, until this issue is fixed?

@jeromekelleher jeromekelleher added this to the Python 0.4.1 milestone Dec 9, 2021
@jeromekelleher
Copy link
Member

We probably should, putting into tskit 0.4.1

@hyanwong
Copy link
Member

hyanwong commented Jul 13, 2023

A demo example from https://github.com/hyanwong/ancestor-PCA/issues/1: you might hope to get the same pairwise information out of all these 3 tree sequences (in the last, each pair occurs in each tree)

image

@hyanwong
Copy link
Member

hyanwong commented Jul 14, 2023

Revisiting this, also the branch-length distance between two samples that are isolated from each other in the topology should (IMO) be NaN or infinity. But currently, e.g. two isolated samples have a distance of 0 between them:

empty_ts = tskit.Tree.generate_comb(3).tree_sequence.delete_intervals([[0, 1]])
assert empty_ts.divergence([[0],[1]], mode="branch") == 0

This presumably applies to a number of other branch-length stats too.

@hyanwong
Copy link
Member

I'm hitting this again when working with the new 1000 genome inferred tree sequences:

import tszip
ts = tszip.decompress("1kgp-chr20p-filterTrimmedWithCpg-mm0-post-processed.trees.tsz")
print(ts.diversity(), ts.trim().diversity())

Gives the following (the second number is the correct one)

0.00035674648799140446 0.0008991582285975136

The first number is over 50% smaller than it should be because the chr20p tree sequence only covers 40% of the total sequence length. This is pretty confusing, IMO. How difficult would it be to raise an error if any of the trees have no edges?

@jeromekelleher
Copy link
Member

How difficult would it be to raise an error if any of the trees have no edges?

It's not immediately obvious to me how it should be done, but you're right we should be raising an error for this.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants