Use multi-select instead of a full sort for DynamicRange creation #13914

HoustonPutman · 2024-10-15T00:21:53Z

Resolves #13760

Description

This is using a similar approach to how Solr used to compute multiple percentiles at a single time. Basically utilize the quick select method, but instead of following a single path, follow a path for each of the ks that is requested. Multi-quickselect.

That's what I originally made, until I realized that the DynamicRangeUtil is weighted, so I refactored it to choose by weights instead, and also capture the running-value-total and running-weight-total, because that information is used in the DynamicRangeInfo.

My goal was to add this as a generic capability of the Selector (or IntroSelector) class, but because of the limitations above, it is currently a separate class to handle this. If there's any suggestions on how to make this generic enough to be put in the generic class, that would be great. But it might not be worth the effort if it wouldn't be used anywhere else.

As for the original multi-quickSelect algorithm I mentioned, I looked for other multi-select use cases across Lucene, but I only found one instance (ScalarQuantizer does two select calls in succession). If there's more instances we can find, I would be happy to add multiSelect as an option on the Selector class, and implement it in all provided classes.

To-Do

The code needs to be cleaned up and better documented, this is just a POC
Benchmarks comparing this to the full-sorting implementation.

Caveat

The implement is slightly different, as it will pick the groups according to "The first value for which the running weight is <= weight-range-boundary". The old logic would start counting again after a weight range was complete, which removes information from the overflow of previous weight-ranges. I'm not sure either approach is right or wrong, but I wanted to explicitly state how the results would be different and why I had to alter a unit test to pass.

mikemccand · 2024-10-15T14:07:45Z

I have not looked closely but this sounds very cool!!

stefanvodita

Thank you @HoustonPutman, this is really interesting!

The old logic would start counting again after a weight range was complete, which removes information from the overflow of previous weight-ranges

Isn't there a risk with this PR that we would have a heavily weighted item at the end of a range that would make it so the next range is empty or almost empty?

stefanvodita · 2024-10-15T23:12:16Z

lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java

      List<DynamicRangeUtil.DynamicRangeInfo> mockResult,
      List<DynamicRangeUtil.DynamicRangeInfo> expectedResult) {
-    return mockResult.size() == expectedResult.size() && mockResult.containsAll(expectedResult);


Oops, thanks for changing this!

stefanvodita · 2024-10-19T19:34:47Z

lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java

+    double rangeWeightTarget = (double) totalWeight / topN;
+    double[] kWeights = new double[topN];
+    for (int i = 0; i < topN; i++) {
+      kWeights[i] = (i == 0 ? 0 : kWeights[i - 1]) + rangeWeightTarget;


There could be some subtlety here I don't understand, but I'm wondering if we can we make this simpler.

for (int i = 1; i < topN; i++) { kWeights[i] = i * rangeWeightTarget; }

The array should be initialised with zeros by default, so we can also write

for (int i = 1; i < topN; i++) { kWeights[i] = kWeights[i - 1] + rangeWeightTarget; }

Wow yeah, both are better (though I like the first). This is the beauty of PR reviews haha. When you are 500 lines into a change, who knows what dumb things you will write...

I thought maybe you wanted to avoid the multiplications 😄
Which would be fair, my guess is the second one is faster because we're only doing sums and referencing values in the array that are cached.

stefanvodita · 2024-10-19T19:47:35Z

lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java

+      long beforeTotalValue,
+      long rangeWeight,
+      long beforeWeight,
+      double[] kWeights) {


kWeights doesn't communicate to me what these are. I wonder if there's a more descriptive name we could use or otherwise if we could explain in a comment. We use this k prefix a lot.

That's a very fair point. I struggled naming this. Basically the k prefix is for choosing where to select. So kWeights is the weight-cutoffs that you want to select. if you have a total weight of 100 and want to group into 5, then kWeights would be [20,40,60,80,100]. Very open to better naming anywhere!

Does it make sense to replace k with quantile maybe?

stefanvodita · 2024-10-19T19:58:58Z

lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java

+      this.random = new SplittableRandom();
+    }
+    SplittableRandom random = this.random;
+    for (int i = to - 1; i > from; i--) {


Why do we need to go in descending order?

stefanvodita · 2024-10-19T20:01:01Z

lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java

+    }
+    SplittableRandom random = this.random;
+    for (int i = to - 1; i > from; i--) {
+      swap(i, random.nextInt(from, i + 1));


We'll end up swapping an element with itself quite often. Is it worth checking for that case in the swap method and exiting right away?

I haven't even looked at this method. It was straight copied from IntroSelector.

After doing some research, this seems to be the right way of doing it according to the algorithm they specified: https://en.wikipedia.org/wiki/Fisher–Yates_shuffle#The_modern_algorithm

stefanvodita · 2024-10-19T20:13:37Z

lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java

@@ -202,66 +208,83 @@ public SegmentOutput(int hitsLength) {
   *     is used to compute the equi-weight per bin.
   */
  public static List<DynamicRangeInfo> computeDynamicNumericRanges(
-      long[] values, long[] weights, int len, long totalWeight, int topN) {
+      long[] values, long[] weights, int len, long totalValue, long totalWeight, int topN) {


Noting that this can go into 10.1 despite being an API change since this class is marked experimental. Could you add an entry to CHANGES.txt?

stefanvodita · 2024-10-19T20:19:21Z

lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java

+
+  protected abstract long getValue(int i);
+
+  public final WeightRangeInfo[] select(


Should we add some Javadoc explaining what you get if you run this method, maybe with a small example?

Absolutely. Was going to go through and add docs, just wanted to make sure it was a good direction to go in first. Probably worth doing the benchmarking first 🥹

stefanvodita · 2024-10-19T20:27:42Z

lucene/facet/src/test/org/apache/lucene/facet/range/TestDynamicRangeUtil.java

@@ -80,24 +84,25 @@ public void testComputeDynamicNumericRangesWithOneLargeWeight() {
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343, 14L, 14L, 14D));
    expectedRangeInfoList.add(
        new DynamicRangeUtil.DynamicRangeInfo(6, 2766, 32L, 455L, 163.16666666666666D));
-    assertDynamicNumericRangeResults(values, weights, 4, 55109, expectedRangeInfoList);
+    assertDynamicNumericRangeResults(values, weights, 4, 993, 55109, expectedRangeInfoList);
  }

  private static void assertDynamicNumericRangeResults(


Strange things can happen if many or all the weights are zero. I've dealt with that for the Amazon use-case. I wonder if we're handling those situations well in this PR. Should we add a test?

stefanvodita · 2024-10-19T20:28:57Z

lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java

+  }
+
+  // Visible for testing.
+  void select(


This is really interesting, but it goes a little over my head. Really curious to see the benchmark results!

Yeah, the 3-way partitioning was also quite confusing to me until I looked it up. And even then, the code is still quite hard to understand. I copied the default implementation from IntroSelector, then modified it to support multi-select, and select by cumulative weight, not by ordinal. So a lot of the complexity/confusion I can't necessarily speak to. Maybe this would be clearer if in the Javadocs of the class, it called out IntroSelector as the base algorithm?

stefanvodita · 2024-10-19T20:29:25Z

lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java

+    if ((size = to - from) > 3) {
+
+      if (--maxDepth == -1) {
+        // Max recursion depth exceeded: shuffle (only once) and continue.


Can we also say why?

This is from IntroSelector, but basically I think it's saying if I've done enough recursions in QuickSelect, that means that our data has a really bad distribution? So just randomize it a bit and continue. I don't have an opinion as I haven't studied it, but hopefully there was research put into the idea?

HoustonPutman

Isn't there a risk with this PR that we would have a heavily weighted item at the end of a range that would make it so the next range is empty or almost empty?

Yes, that would be a risk. But in the existing implementation, the last range would be almost empty instead. Either way the heavily weighted item has to take space from some group. So in my mind, it's easier to understand that the groups that you are given back better represent the actual quantiles, versus leaving the small group for the end. Users might actually be interested in that last quantile the most.

HoustonPutman · 2024-10-22T15:26:37Z

lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java

+
+  protected abstract long getValue(int i);
+
+  public final WeightRangeInfo[] select(


Absolutely. Was going to go through and add docs, just wanted to make sure it was a good direction to go in first. Probably worth doing the benchmarking first 🥹

HoustonPutman · 2024-10-22T15:28:47Z

lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java

+      long beforeTotalValue,
+      long rangeWeight,
+      long beforeWeight,
+      double[] kWeights) {


That's a very fair point. I struggled naming this. Basically the k prefix is for choosing where to select. So kWeights is the weight-cutoffs that you want to select. if you have a total weight of 100 and want to group into 5, then kWeights would be [20,40,60,80,100]. Very open to better naming anywhere!

HoustonPutman · 2024-10-22T15:31:30Z

lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java

+  }
+
+  // Visible for testing.
+  void select(


Yeah, the 3-way partitioning was also quite confusing to me until I looked it up. And even then, the code is still quite hard to understand. I copied the default implementation from IntroSelector, then modified it to support multi-select, and select by cumulative weight, not by ordinal. So a lot of the complexity/confusion I can't necessarily speak to. Maybe this would be clearer if in the Javadocs of the class, it called out IntroSelector as the base algorithm?

HoustonPutman · 2024-10-22T15:33:03Z

lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java

+    if ((size = to - from) > 3) {
+
+      if (--maxDepth == -1) {
+        // Max recursion depth exceeded: shuffle (only once) and continue.


This is from IntroSelector, but basically I think it's saying if I've done enough recursions in QuickSelect, that means that our data has a really bad distribution? So just randomize it a bit and continue. I don't have an opinion as I haven't studied it, but hopefully there was research put into the idea?

HoustonPutman · 2024-10-22T15:38:10Z

lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java

+    double rangeWeightTarget = (double) totalWeight / topN;
+    double[] kWeights = new double[topN];
+    for (int i = 0; i < topN; i++) {
+      kWeights[i] = (i == 0 ? 0 : kWeights[i - 1]) + rangeWeightTarget;


Wow yeah, both are better (though I like the first). This is the beauty of PR reviews haha. When you are 500 lines into a change, who knows what dumb things you will write...

HoustonPutman · 2024-10-22T16:03:45Z

lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java

+    }
+    SplittableRandom random = this.random;
+    for (int i = to - 1; i > from; i--) {
+      swap(i, random.nextInt(from, i + 1));


I haven't even looked at this method. It was straight copied from IntroSelector.

After doing some research, this seems to be the right way of doing it according to the algorithm they specified: https://en.wikipedia.org/wiki/Fisher–Yates_shuffle#The_modern_algorithm

github-actions · 2024-11-10T00:24:01Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

houserjohn · 2025-02-06T03:35:52Z

This is a great improvement for Dynamic Ranges @HoustonPutman! After looking into some more test cases, I believe there may be a bug for some unsorted value lists. Consider this unit test:

public void testComputeDynamicNumericRangesWithMisplacedValue() {
    List<DynamicRangeUtil.DynamicRangeInfo> expectedRangeInfoList = new ArrayList<>();
    long[] values =
        new long[] {
          1, 2, 11, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 12, 111, 112, 113, 114, 115
        };
    long[] weights =
        new long[] {
          2, 3, 12, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 13, 112, 113, 114, 115, 116
        };

    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(8, 444, 1L, 104L, 54.5D));
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(4, 430, 105L, 108L, 106.5D));
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(4, 446, 109L, 112L, 110.5D));
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(3, 345, 113L, 115L, 114.0D));
    assertDynamicNumericRangeResults(values, weights, 4, 1646, 1665, expectedRangeInfoList);
  }

With the following error (Notice values marked with **):

java.lang.AssertionError: expected:<[DynamicRangeInfo[count=8, weight=444, min=**1**, max=104, centroid=54.5], DynamicRangeInfo[count=4, weight=430, min=105, max=108, centroid=106.5], DynamicRangeInfo[count=4, weight=446, min=109, max=112, centroid=110.5], DynamicRangeInfo[count=3, weight=345, min=113, max=115, centroid=114.0]]> but was:<[DynamicRangeInfo[count=8, weight=444, min=**12**, max=104, centroid=54.5], DynamicRangeInfo[count=4, weight=430, min=105, max=108, centroid=106.5], DynamicRangeInfo[count=4, weight=446, min=109, max=112, centroid=110.5], DynamicRangeInfo[count=3, weight=345, min=113, max=115, centroid=114.0]]>

I have also posted a fix in the review, but there may be better solutions.

houserjohn · 2025-02-06T03:38:50Z

lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java

+    int lastIdx = -1;
+    long lastTotalValue = 0;
+    long lastTotalWeight = 0;
+    for (int kIdx = 0; kIdx < topN; kIdx++) {


It appears that because the partitions are not sorted, you may need to search for the min and max of a range in a different way like:

WeightedSelector.WeightRangeInfo weightRangeInfo = kIndexResults[kIdx]; if (weightRangeInfo.index() > -1) { int count = weightRangeInfo.index() - lastIdx; + long min = values[lastIdx + 1]; + long max = values[lastIdx + 1]; + for (int i = lastIdx + 2; i < weightRangeInfo.index() + 1; i++) { + min = Math.min(min, values[i]); + max = Math.max(max, values[i]); + } dynamicRangeResult.add( new DynamicRangeInfo( count, (weightRangeInfo.runningWeight() - lastTotalWeight), - values[lastIdx + 1], - values[weightRangeInfo.index()], + min, + max, (double) (weightRangeInfo.runningValueSum() - lastTotalValue) / count)); lastIdx = weightRangeInfo.index(); lastTotalValue = weightRangeInfo.runningValueSum();

Thank you for highlighting this @houserjohn! I wonder if we're negating the performance improvement by iterating through all those values. @HoustonPutman - what do you think?

Yeah, so we should be finding either the min or the max here, given the algorithm needs to "select" given points and those points have to be one or the other. Your test tells me that this is the max, since the min values are incorrect.

Another way of doing this is to also "select" the min, so modify the algorithm to require two matches instead of one for each bucket... I'd have to take some time to see what this would take, but I think it should definitely be faster.

Note, given this change, we would be iterating through each bucket, this would just add O(n) time (O(n₁) + O(n₂) + ...), to the already ~O(n) quick-select algorithm. Still quicker than the O(nlogn) sort option. But I'd like to see if we could bake this into the algorithm in a smarter way

So it's actually simpler than I thought. Basically we already have the next minimum value if the quantile is found in the last value of the bottom group, because it is our pivot value. Same for the top group, if the last value of it is our quantile maximum, then either it is the last value in the list (at which point there is no next-minimum to find), or it is under some other pivot value we have found (so no need to find it, the pivot is already sorted correctly).

The only time we need to do something is when the last pivot value is the end of a quantile, then we need to find the bottom of the top group. So we pass to select() that we want to select the minimum in that range.

In the select() we only need to find the minimum for the bottom group after pivoting. If this bottom group contains a quantile-end, then just recursively tell it to find the minimum. If the bottom group does not contain a quantile-end, then we won't be touching these values again. So go through them and find the minimum, swapping it into the from position.

So just a little more work for the algorithm, adding maybe O(log(n) * k) time. That's just a guess though.

Wait it's even simpler than that. In the pivoting, we are already comparing all of the values. If the range contains a quantile minimum (logic still the same as described above), then keep track of the minimum in the bottom group, and swap it into the from position at the end of pivoting...

Hmm maybe this would end up with more comparisons though... Since in the logic above, you are only adding comparisons for the remaining elements, where in this logic you are searching for the minimum across a much larger range (the initial bottom group)... Will think about this.

Nice fix @HoustonPutman, I can also confirm that the latest commit did fix the minimum value in a range bug for testComputeDynamicNumericRangesWithMisplacedValue.

Additionally, I also thought about tracking the range minimum while doing the pivot comparisons. It's interesting to compare that to the former method you described. I am also working on a benchmark for Dynamic Range Faceting in luceneutil. Maybe when that is finished, we can run both implementations and empirically determine which version is more optimal?

Sounds good! It's pretty trivial to switch over to the other implementation, so happy to test that out when the benchmark is available!

…elect

houserjohn · 2025-02-14T02:07:59Z

Hey @HoustonPutman, I just published GH#14238 which contains all of the unit tests that I've created so far. Note that there was a slight API change between the main branch and this PR, so I included some unit tests that work for this PR below. While some of these tests are not considering the caveat (the change in behavior) you mentioned, I believe there are a few unit tests that capture a few existing issues. For instance:

public void testComputeDynamicNumericRangesWithLargeTopN() {
    List<DynamicRangeUtil.DynamicRangeInfo> expectedRangeInfoList = new ArrayList<>();
    long[] values = new long[] {487, 439, 794, 277};
    long[] weights = new long[] {59, 508, 736, 560};

    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 560L, 277L, 277L, 277D));
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 508L, 439L, 439L, 439D));
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(2, 795L, 487L, 794L, 640.5D));
    assertDynamicNumericRangeResults(values, weights, 42, 1997L, 1863L, expectedRangeInfoList);
  }

Gives the exception:

java.lang.IllegalArgumentException: All kWeights must be < beforeWeight + rangeWeight
    at __randomizedtesting.SeedInfo.seed([913AAD1D60B9263B:FFD4EE9EA025DBF]:0)
    at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.util.WeightedSelector.checkArgs(WeightedSelector.java:82)
    at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.util.WeightedSelector.select(WeightedSelector.java:57)
    at org.apache.lucene.facet.range.DynamicRangeUtil.computeDynamicNumericRanges(DynamicRangeUtil.java:266)
    at org.apache.lucene.facet.range.TestDynamicRangeUtil.testComputeDynamicNumericRangesWithLargeTopN(TestDynamicRangeUtil.java:169)
    at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)

I've tried to track down this bug, and I haven't quite fixed it, but I believe the fix is related to these lines:

--- a/lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java
+++ b/lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java
@@ -216,7 +216,7 @@ public final class DynamicRangeUtil {
       return dynamicRangeResult;
     }
 
-    double rangeWeightTarget = (double) totalWeight / topN;
+    double rangeWeightTarget = (double) totalWeight / Math.min(topN, len);
     double[] kWeights = new double[topN];
     for (int i = 0; i < topN; i++) {
       kWeights[i] = (i == 0 ? 0 : kWeights[i - 1]) + rangeWeightTarget;
--

Additionally:

public void testComputeDynamicNumericRangesWithSameWeights() {
    List<DynamicRangeUtil.DynamicRangeInfo> expectedRangeInfoList = new ArrayList<>();
    long totalValue = 0;
    long[] values = new long[100];
    long[] weights = new long[100];
    for (int i = 0; i < 100; i++) {
      values[i] = i;
      weights[i] = 50;
      totalValue += i;
    }

    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 0L, 24L, 12.0D));
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 25L, 49L, 37.0D));
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 50L, 74L, 62.0D));
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 75L, 99L, 87.0D));
    assertDynamicNumericRangeResults(values, weights, 4, totalValue, 5000L, expectedRangeInfoList);
  }

Gives (Important values marked with **):

    java.lang.AssertionError: expected:<[DynamicRangeInfo[count=**25**, weight=1250, min=0, max=24, centroid=12.0], DynamicRangeInfo[count=25, weight=1250, min=25, max=49, centroid=37.0], DynamicRangeInfo[count=25, weight=1250, min=50, max=74, centroid=62.0], DynamicRangeInfo[count=25, weight=1250, min=75, max=99, centroid=87.0]]> but was:<[DynamicRangeInfo[count=**26**, weight=1300, min=0, max=25, centroid=12.5], DynamicRangeInfo[count=25, weight=1250, min=26, max=50, centroid=38.0], DynamicRangeInfo[count=25, weight=1250, min=51, max=75, centroid=63.0], DynamicRangeInfo[count=24, weight=1200, min=76, max=99, centroid=87.5]]>
        at __randomizedtesting.SeedInfo.seed([DA6EB4C0C0CA4022:DBA0A4538AB03899]:0)
        at junit@4.13.1/org.junit.Assert.fail(Assert.java:89)
        at junit@4.13.1/org.junit.Assert.failNotEquals(Assert.java:835)
        at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:120)
        at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:146)
        at org.apache.lucene.facet.range.TestDynamicRangeUtil.compareDynamicRangeResult(TestDynamicRangeUtil.java:361)
        at org.apache.lucene.facet.range.TestDynamicRangeUtil.assertDynamicNumericRangeResults(TestDynamicRangeUtil.java:351)
        at org.apache.lucene.facet.range.TestDynamicRangeUtil.testComputeDynamicNumericRangesWithSameWeights(TestDynamicRangeUtil.java:156)

I know you mentioned there is a change in behavior in the caveat, but I do believe that this example should probably return ranges with equal counts.

HoustonPutman · 2025-02-17T18:12:09Z

I know you mentioned there is a change in behavior in the caveat, but I do believe that this example should probably return ranges with equal counts.

This one was a <= that should have been a <. I've fixed it and the tests pass.

I've tried to track down this bug, and I haven't quite fixed it, but I believe the fix is related to these lines:

Yeah this one was an issue with floating point math. I've set it such that the last quantile will always be the total, no need to do math for that.

The test still returns different results, but at least it fails without an exception.

houserjohn · 2025-02-18T19:37:02Z

@HoustonPutman I can confirm that the latest commits fixed the exception in testComputeDynamicNumericRangesWithLargeTopN and the issue in testComputeDynamicNumericRangesWithSameWeights.

Some of the randomized testing included in GH#14238 revealed some more bugs (they also revealed some of my own bugs in GH#14238):

public void testComputeDynamicNumericRangesWithSameWeightsOutOfOrder() {
    List<DynamicRangeUtil.DynamicRangeInfo> expectedRangeInfoList = new ArrayList<>();
    long[] values =
        new long[] {
          20, 15, 59, 49, 13, 93, 72, 21, 36, 81, 57, 1, 90, 79, 16, 51, 7, 17, 25, 63, 12, 5, 83,
          66, 48, 43, 55, 78, 64, 77, 65, 73, 80, 37, 54, 50, 95, 31, 97, 3, 82, 29, 70, 26, 4, 46,
          34, 67, 87, 0, 30, 19, 41, 85, 84, 89, 8, 10, 22, 28, 6, 23, 88, 40, 33, 44, 18, 27, 69,
          38, 91, 98, 62, 14, 35, 2, 92, 47, 94, 75, 32, 99, 86, 71, 74, 24, 52, 96, 9, 58, 39, 76,
          56, 11, 53, 61, 42, 68, 60, 45
        };
    long[] weights =
        new long[] {
          50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50,
          50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50,
          50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50,
          50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50,
          50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50
        };

    // This is testComputeDynamicNumericRangesWithSameWeightsShuffled with seed
    // 9AE79D72C8DD56D8
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 0L, 24L, 12.0D));
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 25L, 49L, 37.0D));
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 50L, 74L, 62.0D));
    expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 75L, 99L, 87.0D));
    assertDynamicNumericRangeResults(values, weights, 4, 4950L, 5000L, expectedRangeInfoList);
}

Gives (look at **):

   >     java.lang.AssertionError: expected:<[DynamicRangeInfo[count=25, weight=1250, min=0, max=24, centroid=12.0], DynamicRangeInfo[count=25, weight=1250, **min=25**, max=49, centroid=37.0], DynamicRangeInfo[count=25, weight=1250, min=50, max=74, centroid=62.0], DynamicRangeInfo[count=25, weight=1250, min=75, max=99, centroid=87.0]]> but was:<[DynamicRangeInfo[count=25, weight=1250, min=0, max=24, centroid=12.0], DynamicRangeInfo[count=25, weight=1250, **min=43**, max=49, centroid=37.0], DynamicRangeInfo[count=25, weight=1250, min=50, max=74, centroid=62.0], DynamicRangeInfo[count=25, weight=1250, min=75, max=99, centroid=87.0]]>
   >         at __randomizedtesting.SeedInfo.seed([1E722DB63C0BE5AA:F241CBDDDDB3FF8C]:0)
   >         at junit@4.13.1/org.junit.Assert.fail(Assert.java:89)
   >         at junit@4.13.1/org.junit.Assert.failNotEquals(Assert.java:835)
   >         at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:120)
   >         at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:146)
   >         at org.apache.lucene.facet.range.TestDynamicRangeUtil.assertDynamicNumericRangeResults(TestDynamicRangeUtil.java:389)
   >         at org.apache.lucene.facet.range.TestDynamicRangeUtil.testComputeDynamicNumericRangesWithSameWeightsOutOfOrder(TestDynamicRangeUtil.java:172)

I believe that this is still related to the minimum in a range bug. Note that this result is with the latest commits you added. Additionally, I think it might be helpful if you run some of these randomized tests overnight to reduce some of the back and forth. I'll post another comment later today with a modification of those randomization tests that you should be able to run.

houserjohn · 2025-02-19T07:29:34Z

Here are the promised modified randomized unit tests. These should work with your API change, but you might need to modify them to suit the caveat you mentioned. Of course, add the correct imports:

public void testComputeDynamicNumericRangesWithSameWeightsShuffled() {
  List<DynamicRangeUtil.DynamicRangeInfo> expectedRangeInfoList = new ArrayList<>();
  long[] values = new long[100];
  long[] weights = new long[100];
  for (int i = 0; i < 100; i++) {
    values[i] = i;
    weights[i] = 50;
  }

  // Shuffling the values and weights should not change the answer between runs
  // We expect that returned ranges should come in a strict, deterministic order
  // with the same values and weights
  shuffleValuesWeights(values, weights);
  expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 0L, 24L, 12.0D));
  expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 25L, 49L, 37.0D));
  expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 50L, 74L, 62.0D));
  expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(25, 1250L, 75L, 99L, 87.0D));
  assertDynamicNumericRangeResults(values, weights, 4, 4950L, 5000L, expectedRangeInfoList);
}
  
public void testComputeDynamicNumericRangesWithSameValuesShuffled() {
  List<DynamicRangeUtil.DynamicRangeInfo> expectedRangeInfoList = new ArrayList<>();
  long totalWeight = 0;
  long[] values = new long[100];
  long[] weights = new long[100];
  for (int i = 0; i < 100; i++) {
    values[i] = 50;
    weights[i] = i;
    totalWeight += i;
  }

  // Shuffling the values and weights should not change the answer between runs
  // We expect that returned ranges should come in a strict, deterministic order
  // with the same values and weights
  shuffleValuesWeights(values, weights);
  expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(51, 1275L, 50L, 50L, 50D));
  expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(21, 1281L, 50L, 50L, 50D));
  expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(16, 1272L, 50L, 50L, 50D));
  expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(12, 1122L, 50L, 50L, 50D));

  assertDynamicNumericRangeResults(values, weights, 4, 5000L, totalWeight, expectedRangeInfoList);
}

public void testComputeDynamicNumericRangesWithRandomValues() {
  int arraySize = random().nextInt(100);
  long[] values = new long[arraySize];
  long[] weights = new long[arraySize];

  for (int i = 0; i < arraySize; i++) {
    values[i] = random().nextLong(1000);
    weights[i] = random().nextLong(1000);
  }

  int topN = random().nextInt(100);

  long totalWeight = 0;
  long totalValue = 0;
  for (int i = 0; i < arraySize; i++) {
    totalWeight += weights[i];
    totalValue += values[i];
  }

  assertDynamicNumericRangeValidProperties(values, weights, topN, totalValue, totalWeight);
}
  
private static void assertDynamicNumericRangeValidProperties(
    long[] values, long[] weights, int topN, long totalValue, long totalWeight) {

  List<WeightedPair> sortedPairs = new ArrayList<>();
  for (int i = 0; i < values.length; i++) {
    long value = values[i];
    long weight = weights[i];
    WeightedPair pair = new WeightedPair(value, weight);
    sortedPairs.add(pair);
  }

  sortedPairs.sort(
      Comparator.comparingLong(WeightedPair::value).thenComparingLong(WeightedPair::weight));

  int len = values.length;

  double rangeWeightTarget = (double) totalWeight / Math.min(topN, len);

  List<DynamicRangeUtil.DynamicRangeInfo> mockDynamicRangeResult =
      DynamicRangeUtil.computeDynamicNumericRanges(
          values, weights, values.length, totalValue, totalWeight, topN);

  // Zero requested ranges (TopN) should return a empty list of ranges regardless of inputs
  if (topN == 0) {
    assertTrue(mockDynamicRangeResult.size() == 0);
    return; // Early return; do not check anything else
  }

  // Adjacent ranges do not overlap - only adjacent max-min can overlap
  for (int i = 0; i < mockDynamicRangeResult.size() - 1; i++) {
    DynamicRangeUtil.DynamicRangeInfo rangeInfo = mockDynamicRangeResult.get(i);
    DynamicRangeUtil.DynamicRangeInfo nextRangeInfo = mockDynamicRangeResult.get(i + 1);
    assertTrue(rangeInfo.max() <= nextRangeInfo.min());
  }

  // The count of every range sums to the number of values
  int accuCount = 0;
  for (int i = 0; i < mockDynamicRangeResult.size(); i++) {
    DynamicRangeUtil.DynamicRangeInfo rangeInfo = mockDynamicRangeResult.get(i);
    int count = rangeInfo.count();
    accuCount += count;
  }
  assertTrue(accuCount == len);

  // The sum of every range weight equals the total weight
  long accuWeight = 0;
  for (int i = 0; i < mockDynamicRangeResult.size(); i++) {
    DynamicRangeUtil.DynamicRangeInfo rangeInfo = mockDynamicRangeResult.get(i);
    long weight = rangeInfo.weight();
    accuWeight += weight;
  }
  assertTrue(accuWeight == totalWeight);

  // All values appear in atleast one range
  for (int pairOffset = 0, rangeIdx = 0; rangeIdx < mockDynamicRangeResult.size(); rangeIdx++) {
    DynamicRangeUtil.DynamicRangeInfo rangeInfo = mockDynamicRangeResult.get(rangeIdx);
    int count = rangeInfo.count();
    for (int i = pairOffset; i < pairOffset + count; i++) {
      WeightedPair pair = sortedPairs.get(i);
      long value = pair.value();
      assertTrue(rangeInfo.min() <= value && value <= rangeInfo.max());
    }
    pairOffset += count;
  }

  // The minimum/maximum of each range is actually the smallest/largest value
  for (int pairOffset = 0, rangeIdx = 0; rangeIdx < mockDynamicRangeResult.size(); rangeIdx++) {
    DynamicRangeUtil.DynamicRangeInfo rangeInfo = mockDynamicRangeResult.get(rangeIdx);
    int count = rangeInfo.count();
    WeightedPair minPair = sortedPairs.get(pairOffset);
    WeightedPair maxPair = sortedPairs.get(pairOffset + count - 1);
    long min = minPair.value();
    long max = maxPair.value();
    assertTrue(rangeInfo.min() == min);
    assertTrue(rangeInfo.max() == max);
    pairOffset += count;
  }

  // Weights of each range is over the rangeWeightTarget - exclude last range
  for (int i = 0; i < mockDynamicRangeResult.size() - 1; i++) {
    DynamicRangeUtil.DynamicRangeInfo rangeInfo = mockDynamicRangeResult.get(i);
    assertTrue(rangeInfo.weight() >= rangeWeightTarget);
  }

  // Removing the last weight from a range brings it under the rangeWeightTarget - exclude last
  // range
  for (int pairOffset = 0, rangeIdx = 0;
      rangeIdx < mockDynamicRangeResult.size() - 1;
      rangeIdx++) {
    DynamicRangeUtil.DynamicRangeInfo rangeInfo = mockDynamicRangeResult.get(rangeIdx);
    int count = rangeInfo.count();
    WeightedPair lastPair = sortedPairs.get(pairOffset + count - 1);
    long lastWeight = lastPair.weight();
    pairOffset += count;
    assertTrue(rangeInfo.weight() - lastWeight < rangeWeightTarget);
  }

  // Centroids for each range are correct
  for (int pairOffset = 0, rangeIdx = 0; rangeIdx < mockDynamicRangeResult.size(); rangeIdx++) {
    DynamicRangeUtil.DynamicRangeInfo rangeInfo = mockDynamicRangeResult.get(rangeIdx);
    int count = rangeInfo.count();
    long accuValue = 0;
    for (int i = pairOffset; i < pairOffset + count; i++) {
      WeightedPair pair = sortedPairs.get(i);
      long value = pair.value();
      accuValue += value;
    }
    pairOffset += count;
    assertTrue(rangeInfo.centroid() == ((double) accuValue / count));
  }
}
  
 /** Implementation of Durstenfeld's algorithm for shuffling values and weights */
private static void shuffleValuesWeights(long[] values, long[] weights) {
  for (int i = values.length - 1; i > 0; i--) {
    int rdmIdx = random().nextInt(i + 1);
    long tmpValue = values[i];
    long tmpWeight = weights[i];
    values[i] = values[rdmIdx];
    weights[i] = weights[rdmIdx];
    values[rdmIdx] = tmpValue;
    weights[rdmIdx] = tmpWeight;
  }
}

/**
 * Holds parameters of a weighted pair.
 *
 * @param value the value of the pair
 * @param weight the weight of the pair
 */
private record WeightedPair(long value, long weight) {}

Additionally, here is a command that you can run from the command line to search for bugs:
First, ./gradlew clean and ./gradlew build

for i in {1..100}; do; echo $i; ./gradlew check; if [ $? -gt 0 ]; then; cat $path_to_test_output >> "../errors.txt"; fi; done;

$path_to_test_output should be a path ending with OUTPUT-org.apache.lucene.facet.range.TestDynamicRangeUtil.txt. This should be the same location that you go whenever you want to view the output from unit tests (which appears when a unit test fails after a build fails).

After the command finishes, all of the found bugs should be in ../error.txt.

Use multi-select instead of sort for Dynamic Ranges

d197f01

HoustonPutman added the module:facet label Oct 15, 2024

HoustonPutman requested a review from stefanvodita October 15, 2024 00:21

stefanvodita reviewed Oct 19, 2024

View reviewed changes

HoustonPutman commented Oct 25, 2024

View reviewed changes

mikemccand mentioned this pull request Nov 4, 2024

Add benchmark coverage for dynamic numeric range faceting mikemccand/luceneutil#311

Open

github-actions bot added the Stale label Nov 10, 2024

houserjohn reviewed Feb 6, 2025

View reviewed changes

github-actions bot removed the Stale label Feb 7, 2025

HoustonPutman added 2 commits February 12, 2025 11:37

Merge remote-tracking branch 'apache/main' into dynamic-range-multi-s…

5d4e273

…elect

Fix min bug, add test

af78c39

github-actions bot added the module:core/other label Feb 12, 2025

houserjohn mentioned this pull request Feb 14, 2025

[Unit] Increase Dynamic Range Faceting coverage by adding previously nonexistent unit tests #14238

Merged

HoustonPutman added 2 commits February 17, 2025 11:58

Fix equal buckets bug.

eb03663

Fix floating point bug when computing quantiles

93ef026

houserjohn mentioned this pull request Feb 19, 2025

[Unit] Increase Dynamic Range Faceting coverage and address edge cases #14258

Merged


		protected abstract long getValue(int i);

		public final WeightRangeInfo[] select(

Use multi-select instead of a full sort for DynamicRange creation #13914

Are you sure you want to change the base?

Use multi-select instead of a full sort for DynamicRange creation #13914

Conversation

HoustonPutman commented Oct 15, 2024 • edited Loading

Description

To-Do

Caveat

mikemccand commented Oct 15, 2024

stefanvodita left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HoustonPutman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 10, 2024

houserjohn commented Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

houserjohn commented Feb 14, 2025

HoustonPutman commented Feb 17, 2025

houserjohn commented Feb 18, 2025

houserjohn commented Feb 19, 2025

HoustonPutman commented Oct 15, 2024 •

edited

Loading

houserjohn commented Feb 6, 2025 •

edited

Loading