Index Estimators support For Cost Base evaluation #128

YANG-DB · 2021-09-29T09:42:30Z

YANG-DB
Sep 29, 2021
Maintainer

Index Estimators For cost based execution plan

This is a suggestion to enhance the capabilities of the underlying cost based analysis engine to support the usage of sketch estimators.

We would like to be able to use fast and accurate estimators that are pre-calculated in advanced for every index so that we can better (and faster) estimate the cost of the execution path based on a given query.

Currently the lucene file structure does not contain a specific location for the sketch estimators - the first thing that is needed will be adding a '/stats' folder to each index structure that will hold all the materialized sketch files.

The sketch estimators:

We can use https://datasketches.apache.org/ opensource sketch library, this will allow us to take advantage of the following algorithms:

Distinct Counting
- HLL
- Theta
- Theta Tuple
Most Frequent
- Frequent Items
- Frequent distinct (tuple) Items
Quantiles And Histograms
- KLL

Use cases

Given a query we would like to estimate fast and reliable each step in the query and rearrange the execution plan accordingly .

Example

Lets consider the following query:

match  (p1:person)-[k:know]-(p2:person)
where p1.age between (25,45] AND p2.age < 55 AND
    p2.birthLocation near NYC AND k.since > 1/1/2000
return distinct(p1), k , p2

The query will be translated into a series of steps, each step will be translated to the specific query DSL and pushed down into the underlying search engine.

This query requested finding people over 25 that know people under 55 and their connection period started after 1/1/2000
The following indices are relevant to this query :

person
knows

Person index has the following fields:

name (keyword)
age (int)
birth date (date)
birth location (keyword)

Steps estimation
- person age between (25,45]
  - Frequency Histogram: estimated number of original values in (25,45] bin
  - We need to union the Theta to estimate cardinality of people
- person age lower than 55 AND p2.birthLocation near NYC
  - Frequency Histogram: estimated number of original values in (55,inf] bin
  - Tuple Sketches for summarizing birth location attributes as int enumeration
  - We need to union the sketches to estimate join cardinality of people in the filter
- knows connection since 1/1/2000
  - Tuple Sketches for estimating direction

Structure

Since each index is composed of shards and every shard is an in-depended lucene index, we will add a stats folder which is located in the metadata level of the shard and will store materialized statistics regarding the estimation of the different fields in each shard.

├── [ 102]  _state
│   └── [  81]  state-0.st
├── [ 170]  index
│   ├── [  36]  segments.gen
│   ├── [  79]  segments_1
│   └── [   0]  write.lock
└── [ 102]  translog
    └── [  17]  translog-1429697028120
└── [ 102]  sketchs
    └── [  17]  estimator-1
    └── [  17]  estimator-2
    └── [  17]  estimator-3

Taking into consideration that many of the sketched are Set Operations capable (union / intersection / difference) we can use this capability to consolidate estimation information arriving form different shards into once combined estimation.

Sketch Definitions

Once an index mapping is being created (template) the mapping format must be enhanced to support the sketch estimators definition

Sketch Queries

A Sketch API must also be also formulated to support queries of the different sketches

####Sketch accuracy
Sketches are only estimators with a pre-defined error, its accuracy is usually measured in terms of Relative Error (RE = Measured/Truth -1). Sketches are stochastic processes and the estimates produced are random variables that have a probability distribution that is close to the familiar Gaussian

####Sketch implementation issues

Sketches are basically cached in every shard memory space and materialized occasionally to file (could be in sync with the refresh rate)
Sketches are not intended to be calculated on the fly and therefor are currently not intended to be used by the aggregation API
Sketches are build on the fly during the indexing of the documents, this can cause some ingestion rate degradation and must be benchmarked for proper integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index Estimators support For Cost Base evaluation #128

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Index Estimators support For Cost Base evaluation #128

YANG-DB Sep 29, 2021 Maintainer

Index Estimators For cost based execution plan

The sketch estimators:

Use cases

Example

Structure

Sketch Definitions

Sketch Queries

Replies: 0 comments

YANG-DB
Sep 29, 2021
Maintainer