Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Overflow error on 3D datasets #38

Closed
pierre-guillou opened this issue May 17, 2021 · 6 comments
Closed

Overflow error on 3D datasets #38

pierre-guillou opened this issue May 17, 2021 · 6 comments

Comments

@pierre-guillou
Copy link

Hello and thank you for your work!

I am trying to use Ripser to compute persistence diagrams of rather large 3D datasets, coming from the Open-SciVis-Datasets website.

However, Ripser throws a std::overflow_error when I try to compute persistence pairs up to the dimension 2. If I understand correctly the C++ code, this exception is related to the computation of binomial coefficients which depend on the size of the input distance matrix and the maximum dimension requested.

This issue might be related to #25 and #32 since they feature similar problems. I already opened an issue in the scikit-tda Python wrapper (I wanted to use Ripser via this Python API) but since this is more related to the C++ code, this issue seems better to belong here.

Below, you will find a Python script I used to trigger this overflow error. It takes a raw file from Open-Scivis-Datasets and iterates over the edges of the cubical complex to generate a sparse matrix in sparse triplet format that is fed to the Ripser executable. Although the diagram is computed as expected with the smallest datasets (nucleon, marschner_lobb, silicium, up to 120k vertices), the error occurs with fuel, neghip (more than 250k vertices) and every larger dataset.

It there a way to circumvent this computation of these binomial coefficients to handle large datasets (up to 1M vertices)?

Thanks in advance for your help,
Best regards,
Pierre Guillou

import argparse
import subprocess
import time

import numpy as np


def load_raw(input_raw):
    extension = input_raw.split(".")[-1]
    if extension != "raw":
        print("Need a .raw file")
        raise TypeError

    # detect extent and data type from file name
    extent, dtype = input_raw.split(".")[0].split("_")[-2:]
    extent = [int(dim) for dim in extent.split("x")]

    dtype_np = {
        "uint8": np.uint8,
        "int16": np.int16,
        "uint16": np.uint16,
        "float32": np.float32,
        "float64": np.float64,
    }
    dtype = dtype_np[dtype]

    with open(input_raw) as src:
        data = np.fromfile(src, dtype=dtype)
        return data.reshape(extent)


def compute_persistence(sparse_triplets, output_diagram, ripser_executable):
    proc = subprocess.Popen(
        [ripser_executable, "--format", "sparse", "--dim", "2"],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        universal_newlines=True,
    )
    out, _ = proc.communicate(sparse_triplets)

    with open(output_diagram, "w") as dst:
        dst.write(out)


def build_sparse_triplets(data):
    triplets = list()
    # loop over every edge of the cubical complex to get the sparse matrix triplets
    with np.nditer(data, flags=["multi_index"]) as it:
        for v in it:
            i = np.ravel_multi_index(it.multi_index, data.shape)
            for d in range(data.ndim):
                for s in [-1, 1]:
                    coords = list(it.multi_index)
                    s = coords[d] + s
                    if s < 0 or s >= data.shape[d]:
                        continue
                    coords[d] = s
                    j = np.ravel_multi_index(coords, data.shape)
                    if i < j:  # upper triangle distance matrix
                        triplets.append([i, j, max(data.flat[j], v)])

    return "\n".join([f"{i} {j} {v}" for i, j, v in triplets])


def main(input_raw, output_diagram, ripser_executable):
    data = load_raw(input_raw)
    start = time.time()
    sparse_triplets = build_sparse_triplets(data)
    print(f"Build sparse triplets in {time.time() - start:.2f}s")
    compute_persistence(sparse_triplets, output_diagram, ripser_executable)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Wrapper around Ripser C++")
    parser.add_argument("input_raw", help="Input raw file from OpenSciVis datasets")
    parser.add_argument(
        "-o",
        "--output_diagram",
        help="Output diagram in Ripser format",
        default="out.ripser",
    )
    parser.add_argument(
        "-e",
        "--ripser_executable",
        help="Path to the Ripser executable",
        default="ripser/ripser",
    )
    args = parser.parse_args()

    main(args.input_raw, args.output_diagram, args.ripser_executable)
@MassEast
Copy link

Can you also use Ripser on high dimensional data, like 10D?

@ulupo
Copy link

ulupo commented Nov 10, 2021

Homology dimension k computations for a dataset of size N require that it be possible in principle to enumerate all (k+1)-simplices of the full simplex with N vertices. In your case (k=2) this means enumerating all 3-simplices (which are 4-tuples of vertices), and this requires that the binomial coefficient (N choose 4) not exceed 2^63 - 1 (the maximum index for signed 64-bit integers). With N=250000, N choose 4 is approximately 1.6e20, while 2^63 is about 9.2e18.

[Edit: I may be off by 1 in some of the above, but the main point remains.]

After modifying the backend to work with 128-bit integers, you should be able to run Ripser on your dataset. Both @ubauer and @MonkeyBreaker have tried this at some point AFAIK. It can be done!

@ulupo
Copy link

ulupo commented Nov 10, 2021

@MassEast the dimensionality of the data is not an obstacle, Ripser simply looks at the matrix of pairwise distances between all pairs of points in the dataset. This takes a little longer to compute in higher dimensions, but only marginally so, especially compared to the time it takes to compute persistence.

@ubauer ubauer closed this as completed Apr 27, 2022
@julien-tierny
Copy link

julien-tierny commented Apr 25, 2024

After modifying the backend to work with 128-bit integers, you should be able to run Ripser on your dataset. Both @ubauer and @MonkeyBreaker have tried this at some point AFAIK. It can be done!

Thanks @ulupo for the information.
@ubauer @MonkeyBreaker would it be possible that you post the diff to get ripser running with 128-bit integers (it doesn't seem like a one-line edit). Thanks!

@ubauer
Copy link
Member

ubauer commented Apr 28, 2024

There is a branch at https://github.com/Ripser/ripser/tree/128bit implementing this. It should be mostly up to date, and it's actually pretty much a one line edit. Since 128 bit ints seem to be non standardized, this might not work out of the box for some compilers.

@julien-tierny
Copy link

Thanks!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants