Add benchmark coverage for dynamic numeric range faceting #311

mikemccand · 2024-11-04T14:54:47Z

Lucene's dynamic numeric range faceting is a cool auto-ranging feature that looks at the distribution of values for a numeric field among all collected results and picks "good" ranges by roughly evenly distributing another field (relevance, counts) across the requested N ranges.

There are exciting optimizations happening to it recently: apache/lucene#13914

Let's get some coverage in our benchmarks, and maybe nightly benchmarks?

houserjohn · 2025-02-06T04:07:47Z

Adding a summary of an offline discussion:

To add comprehensive benchmarks for dynamic numeric faceting, we would also need a corpus that has "many numbers." Options include wikipedia line files (day_of_year, etc.), NYC taxis corpus, or even the OpenStreetMaps corpus (possible numeric fields). Random/synthetic datasets are discouraged because they are more likely to draw random/synthetic conclusions.

Related work: GH#325 and GH#160 both add related datasets for benchmarks.

houserjohn · 2025-02-11T22:24:14Z

After looking into wikipedia line files, the NYC taxis corpus, and the OpenStreetMaps dataset, I believe that the NYC taxis corpus would be a great dataset for benchmarking dynamic range faceting in luceneutil:

The NYC corpus has many meaningful numbers: total_amt, tip_amt, rate_code, vendor_id. In comparison, from what I could find, the wikipedia line files have fewer meaningful numbers (in the context of dynamic range faceting) and instead focuses on dates and the frequency of words. Additionally, the OpenStreetMaps dataset, from what I saw, is primarily a series of latitude and longitude numbers which have significantly less meaning when they are presented without the other (which would be likely done in dynamic range faceting). Working with data that is easier to reason about would likely make any benchmark easier to understand and contribute to.

stefanvodita mentioned this issue Feb 4, 2025

Task for dynamic ranges #334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark coverage for dynamic numeric range faceting #311

Add benchmark coverage for dynamic numeric range faceting #311

mikemccand commented Nov 4, 2024

houserjohn commented Feb 6, 2025

houserjohn commented Feb 11, 2025

Add benchmark coverage for dynamic numeric range faceting #311

Add benchmark coverage for dynamic numeric range faceting #311

Comments

mikemccand commented Nov 4, 2024

houserjohn commented Feb 6, 2025

houserjohn commented Feb 11, 2025