Deal with missing data in stats #287

petrelharp · 2019-08-06T17:24:02Z

The possibility of missing data is being added in #272; the stats need to be modified to take this into account. At least two issues:

Site stats: isolated samples will be assumed to have the ancestral allele, wrongly.
All stats: the denominators should count up the number of nonmissings.

A nice thing about the current definitions is that many stats won't need modifying the numerator to account for missing data.

hyanwong · 2021-12-08T23:17:36Z

Note that we might need to do something with isolated samples with a mutation above them (which are not meant to be missing). See #2037 (comment)

hyanwong · 2021-12-08T23:18:37Z

Should we currently error out if we detect missing data in a stats calculation, until this issue is fixed?

jeromekelleher · 2021-12-09T12:31:41Z

We probably should, putting into tskit 0.4.1

hyanwong · 2023-07-13T14:13:17Z

A demo example from https://github.com/hyanwong/ancestor-PCA/issues/1: you might hope to get the same pairwise information out of all these 3 tree sequences (in the last, each pair occurs in each tree)

hyanwong · 2023-07-14T12:23:08Z

Revisiting this, also the branch-length distance between two samples that are isolated from each other in the topology should (IMO) be NaN or infinity. But currently, e.g. two isolated samples have a distance of 0 between them:

empty_ts = tskit.Tree.generate_comb(3).tree_sequence.delete_intervals([[0, 1]])
assert empty_ts.divergence([[0],[1]], mode="branch") == 0

This presumably applies to a number of other branch-length stats too.

hyanwong · 2024-07-10T21:26:41Z

I'm hitting this again when working with the new 1000 genome inferred tree sequences:

import tszip
ts = tszip.decompress("1kgp-chr20p-filterTrimmedWithCpg-mm0-post-processed.trees.tsz")
print(ts.diversity(), ts.trim().diversity())

Gives the following (the second number is the correct one)

0.00035674648799140446 0.0008991582285975136

The first number is over 50% smaller than it should be because the chr20p tree sequence only covers 40% of the total sequence length. This is pretty confusing, IMO. How difficult would it be to raise an error if any of the trees have no edges?

jeromekelleher · 2024-07-11T08:19:32Z

How difficult would it be to raise an error if any of the trees have no edges?

It's not immediately obvious to me how it should be done, but you're right we should be raising an error for this.

jeromekelleher added the enhancement New feature or request label Sep 29, 2020

hyanwong mentioned this issue Jul 8, 2021

Restructure statistics.html #1498

Open

jeromekelleher added this to the Python 0.4.1 milestone Dec 9, 2021

hyanwong mentioned this issue Nov 14, 2024

pair_coalescence_rate doesn't account for empty regions (e.g. flanks) #3053

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with missing data in stats #287

Deal with missing data in stats #287

petrelharp commented Aug 6, 2019

hyanwong commented Dec 8, 2021

hyanwong commented Dec 8, 2021

jeromekelleher commented Dec 9, 2021

hyanwong commented Jul 13, 2023 •

edited

Loading

hyanwong commented Jul 14, 2023 •

edited

Loading

hyanwong commented Jul 10, 2024

jeromekelleher commented Jul 11, 2024

Deal with missing data in stats #287

Deal with missing data in stats #287

Comments

petrelharp commented Aug 6, 2019

hyanwong commented Dec 8, 2021

hyanwong commented Dec 8, 2021

jeromekelleher commented Dec 9, 2021

hyanwong commented Jul 13, 2023 • edited Loading

hyanwong commented Jul 14, 2023 • edited Loading

hyanwong commented Jul 10, 2024

jeromekelleher commented Jul 11, 2024

hyanwong commented Jul 13, 2023 •

edited

Loading

hyanwong commented Jul 14, 2023 •

edited

Loading