RFC: add bincount #812

ogrisel · 2024-06-05T15:30:51Z

It seems quite widely adopted:

Although I have not checked how compatible they are.

JAX in particular requires an extra length argument to statically define the length of the output array to make JIT possible and dask uses (nan,) shape to represent a data-dependent output shapes.

The text was updated successfully, but these errors were encountered:

rgommers · 2024-06-13T13:21:17Z

Thanks for the suggestion @ogrisel!

bincount is indeed a heavily used function. I did a quick tally: scikit-learn currently has 117 usages of np.bincount, and SciPy 38. It's also used as a primitive for histogram & co, and it's not easy to implement efficiently with other existing functions in the standard. So from that perspective, I think it meets the bar for inclusion fairly easily.

Signatures do all seem compatible. There is at least one difference in behavior I spotted: the accepted input is non-negative integer values, but the case of negative values is not treated the same - it may raise (NumPy), or clip (JAX).

The NumPy issue tracker contains a lot of open issues about bincount, so the constraints on values, weights etc. seem to need careful specification.

The PyTorch docs also contain a warning about non-deterministic gradients - probably related to the data-dependent behavior:

JAX in particular requires an extra length argument to statically define the length of the output array to make JIT possible and dask uses (nan,) shape to represent a data-dependent output shapes.

Yes, that's non-ideal. CuPy also adds a warning about bincount possibly causing synchronization.

The number of APIs that use data-dependent shapes is still quite low. As of now, I think this is the full list:

boolean indexing
the four unique_* functions
repeats
nonzero

NeilGirdhar · 2024-06-15T19:06:58Z

Just curious, but are you sure you want to keep using the name bincount? It seems redundant: "binning" means counting. What about bin_integers, bin, bucket, or integer_histogram?

ogrisel · 2024-06-16T07:31:11Z

To me, 'binning' means assigning to a bin. E.g. an array of continuous values (floating point values) is transformed into an array of discrete values (bin indices) typically represented as integers. This is a synonym of discretization.

Then you can aggregate those values as a histogram of per bin counts if you wish, but it's not the only thing you can do. In machine learning, binned values are used in many various ways and only some involve histograms, and often only filtered subsets of the original binned array.

NeilGirdhar · 2024-06-16T08:13:03Z

To me, 'binning' means assigning to a bin. E.g. an array of continuous values (floating point values) is transformed into an array of discrete values (bin indices) typically represented as integers. This is a synonym of discretization.

Right, I agree that that's what happens when you "bin" floating point values. This function (bincount) is just the integer version of binning floating point values. In this integer case, bins can be discrete values (or they could be ranges even though bincount doesn't support ranges).

Then you can aggregate those values as a histogram of per bin counts if you wish,

I understand what you're saying. But, it's also a fact that the result of bincount is by definition already a histogram. This is clear because you can convert any call to bincount into a similar call to histogram (possibly massaging the output to get the types right and shapes right).

I still think that "bincount" is a bad name. It is just "binning", and its name should reflect the technical term that people use. No one calls it the "bincount" or "bincounting".

If I had to guess, I think its name reflects a very human tendency to combine two names that would have each been fine:

numpy.bin would have been fine, or
numpy.count might have been okay.

I think someone just glued these two names. This is like when people say "step foot" when "set foot" or "step" would each be fine, but the combination just isn't.

NeilGirdhar · 2024-06-16T08:46:51Z

An alternative to choosing a different name would be to replace bincount with a true integer_histogram:

def integer_histogram(x,
                      /,
                      bins: int | Array,
                      range: None | tuple[int, int] = None,
                      density: bool = False,
                      weights: Array
                      ) -> Array:
    """Compute the histogram of an integer dataset.

    Args:
        x: The input data.  The histogram is computed over the flattened array.
        bins: If bins is an int, it defines the number of equal-width bins in the given range.
            If bins is an integer array, it defines a monotonically increasing array of bin edges,
            including the rightmost edges, which allows for non-uniform bin widths.
        range: The lower and upper range of the bins.  range[1] - range[0] + 1 must be divisible by
            the integer bins.
        weights: An array of weights, of the same shape as a. Each value in a only contributes its
            associated weight towards the bin count (instead of 1). If density is True, the weights
            are normalized, so that the integral of the density over the range remains 1.
        density: If False, the result will contain the number of samples in each bin. If True, the
            result is the value of the probability density function at the bin, normalized such that
            the integral over the range is 1. Note that the sum of the histogram values will not be
            equal to 1 unless bins of unity width are chosen; it is not a probability mass function.
    """

It's just a generalization that supports normalization and general-width bins.

Scanning how bincount is actually used, it seems that normalization is very common. So even if we don't add range and a general bins, I think it might be good to add density to cover this common case.

kgryte added RFC Request for comments. Feature requests and proposed changes. Needs Discussion Needs further discussion. API extension Adds new functions or objects to the API. labels Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: add bincount #812

RFC: add bincount #812

ogrisel commented Jun 5, 2024 •

edited

Loading

rgommers commented Jun 13, 2024

NeilGirdhar commented Jun 15, 2024 •

edited

Loading

ogrisel commented Jun 16, 2024

NeilGirdhar commented Jun 16, 2024 •

edited

Loading

NeilGirdhar commented Jun 16, 2024 •

edited

Loading

RFC: add bincount #812

RFC: add bincount #812

Comments

ogrisel commented Jun 5, 2024 • edited Loading

rgommers commented Jun 13, 2024

NeilGirdhar commented Jun 15, 2024 • edited Loading

ogrisel commented Jun 16, 2024

NeilGirdhar commented Jun 16, 2024 • edited Loading

NeilGirdhar commented Jun 16, 2024 • edited Loading

ogrisel commented Jun 5, 2024 •

edited

Loading

NeilGirdhar commented Jun 15, 2024 •

edited

Loading

NeilGirdhar commented Jun 16, 2024 •

edited

Loading

NeilGirdhar commented Jun 16, 2024 •

edited

Loading