You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Lucene's dynamic numeric range faceting is a cool auto-ranging feature that looks at the distribution of values for a numeric field among all collected results and picks "good" ranges by roughly evenly distributing another field (relevance, counts) across the requested N ranges.
There are exciting optimizations happening to it recently: apache/lucene#13914
Let's get some coverage in our benchmarks, and maybe nightly benchmarks?
The text was updated successfully, but these errors were encountered:
To add comprehensive benchmarks for dynamic numeric faceting, we would also need a corpus that has "many numbers." Options include wikipedia line files (day_of_year, etc.), NYC taxis corpus, or even the OpenStreetMaps corpus (possible numeric fields). Random/synthetic datasets are discouraged because they are more likely to draw random/synthetic conclusions.
Related work: GH#325 and GH#160 both add related datasets for benchmarks.
After looking into wikipedia line files, the NYC taxis corpus, and the OpenStreetMaps dataset, I believe that the NYC taxis corpus would be a great dataset for benchmarking dynamic range faceting in luceneutil:
The NYC corpus has many meaningful numbers: total_amt, tip_amt, rate_code, vendor_id. In comparison, from what I could find, the wikipedia line files have fewer meaningful numbers (in the context of dynamic range faceting) and instead focuses on dates and the frequency of words. Additionally, the OpenStreetMaps dataset, from what I saw, is primarily a series of latitude and longitude numbers which have significantly less meaning when they are presented without the other (which would be likely done in dynamic range faceting). Working with data that is easier to reason about would likely make any benchmark easier to understand and contribute to.
Lucene's dynamic numeric range faceting is a cool auto-ranging feature that looks at the distribution of values for a numeric field among all collected results and picks "good" ranges by roughly evenly distributing another field (relevance, counts) across the requested N ranges.
There are exciting optimizations happening to it recently: apache/lucene#13914
Let's get some coverage in our benchmarks, and maybe nightly benchmarks?
The text was updated successfully, but these errors were encountered: