Skip to content

Memory consumption of tree sequence statistics #647

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
molpopgen opened this issue May 26, 2020 · 8 comments
Closed

Memory consumption of tree sequence statistics #647

molpopgen opened this issue May 26, 2020 · 8 comments
Labels
Performance This issue addresses performance, either runtime or memory statistics

Comments

@molpopgen
Copy link
Member

When the output dimension of a statistic is large, so is the memory consumption.

The following example calculates the pairwise distance matrix for all samples from a single tree and requires a bit over 7GB of RAM for a small number of samples (1000).

import msprime
import numpy as np
import tskit


def pairwise_distance_branch(ts: tskit.TreeSequence, samples: np.array):
    sample_sets = []
    indexes = []
    for i in range(len(samples)):
        sample_sets.append([i])
        for j in range(i + 1, len(samples)):
            indexes.append((i, j))

    div = ts.divergence(sample_sets, indexes=indexes, mode="branch")
    return div


print(msprime.__version__)
print(msprime.tskit.__version__)
ts = msprime.simulate(1000, random_seed=12345)
div = pairwise_distance_branch(ts, [i for i in ts.samples()])

The versions are:
0.7.4
0.2.3

From talking to @petrelharp about this, it appears that some/most of the RAM use may be attributable to some memoization during the calculation that (he feels) may not be necessary?

@petrelharp
Copy link
Contributor

Here's the memory; the algorithm is described here. I think that summary can be recomputed each time it is needed, which would alleviate this problem, although would probably cause a drop in performance.

@jeromekelleher
Copy link
Member

Well, we could add an option to either store the intermediate results or recompute them. I don't think this would add too much complexity. That is, assuming there is a significant difference in performance. If not, then we should get rid of the stored results. Should be easy enough to do a quick test?

@jeromekelleher
Copy link
Member

Is this still an open issue? I think we should probably close it unless someone is intending to follow it up.

@molpopgen
Copy link
Member Author

molpopgen commented Aug 27, 2020 via email

@jeromekelleher
Copy link
Member

OK, let's keep it open.

@benjeffery benjeffery added the Performance This issue addresses performance, either runtime or memory label Sep 29, 2020
@jeromekelleher
Copy link
Member

I'm going to close this because we're addressing the general problem with pairwise statistics using a different framework now (starting from the divergence matrix in #2736)

The short version I think is that the stats API assumes that we have a relatively small number of statistics, and if we have a large number of related statistics to compute then other approaches should be used.

@petrelharp
Copy link
Contributor

I actually think it's still worth running those tests - what you say is true for pairwise stats, but there's also clasess of stats with output equal to the number of samples that would be nice to do this way; e.g. "relatedness matrix times a vector".

@petrelharp petrelharp reopened this Jul 9, 2023
@petrelharp
Copy link
Contributor

Closed in #2980.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Performance This issue addresses performance, either runtime or memory statistics
Projects
None yet
Development

No branches or pull requests

4 participants