Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add benchmark coverage for dynamic numeric range faceting #311

Open
mikemccand opened this issue Nov 4, 2024 · 2 comments
Open

Add benchmark coverage for dynamic numeric range faceting #311

mikemccand opened this issue Nov 4, 2024 · 2 comments

Comments

@mikemccand
Copy link
Owner

Lucene's dynamic numeric range faceting is a cool auto-ranging feature that looks at the distribution of values for a numeric field among all collected results and picks "good" ranges by roughly evenly distributing another field (relevance, counts) across the requested N ranges.

There are exciting optimizations happening to it recently: apache/lucene#13914

Let's get some coverage in our benchmarks, and maybe nightly benchmarks?

@houserjohn
Copy link

Adding a summary of an offline discussion:

To add comprehensive benchmarks for dynamic numeric faceting, we would also need a corpus that has "many numbers." Options include wikipedia line files (day_of_year, etc.), NYC taxis corpus, or even the OpenStreetMaps corpus (possible numeric fields). Random/synthetic datasets are discouraged because they are more likely to draw random/synthetic conclusions.

Related work: GH#325 and GH#160 both add related datasets for benchmarks.

@houserjohn
Copy link

After looking into wikipedia line files, the NYC taxis corpus, and the OpenStreetMaps dataset, I believe that the NYC taxis corpus would be a great dataset for benchmarking dynamic range faceting in luceneutil:

  • The NYC corpus has many meaningful numbers: total_amt, tip_amt, rate_code, vendor_id. In comparison, from what I could find, the wikipedia line files have fewer meaningful numbers (in the context of dynamic range faceting) and instead focuses on dates and the frequency of words. Additionally, the OpenStreetMaps dataset, from what I saw, is primarily a series of latitude and longitude numbers which have significantly less meaning when they are presented without the other (which would be likely done in dynamic range faceting). Working with data that is easier to reason about would likely make any benchmark easier to understand and contribute to.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants