You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a suggestion to enhance the capabilities of the underlying cost based analysis engine to support the usage of sketch estimators.
We would like to be able to use fast and accurate estimators that are pre-calculated in advanced for every index so that we can better (and faster) estimate the cost of the execution path based on a given query.
Currently the lucene file structure does not contain a specific location for the sketch estimators - the first thing that is needed will be adding a '/stats' folder to each index structure that will hold all the materialized sketch files.
The sketch estimators:
We can use https://datasketches.apache.org/ opensource sketch library, this will allow us to take advantage of the following algorithms:
Distinct Counting
HLL
Theta
Theta Tuple
Most Frequent
Frequent Items
Frequent distinct (tuple) Items
Quantiles And Histograms
KLL
Use cases
Given a query we would like to estimate fast and reliable each step in the query and rearrange the execution plan accordingly .
Example
Lets consider the following query:
match (p1:person)-[k:know]-(p2:person)
where p1.age between (25,45] AND p2.age < 55 AND
p2.birthLocation near NYC AND k.since > 1/1/2000
return distinct(p1), k , p2
The query will be translated into a series of steps, each step will be translated to the specific query DSL and pushed down into the underlying search engine.
This query requested finding people over 25 that know people under 55 and their connection period started after 1/1/2000
The following indices are relevant to this query :
person
knows
Person index has the following fields:
name (keyword)
age (int)
birth date (date)
birth location (keyword)
Steps estimation
person age between (25,45]
Frequency Histogram: estimated number of original values in (25,45] bin
We need to union the Theta to estimate cardinality of people
person age lower than 55 AND p2.birthLocation near NYC
Frequency Histogram: estimated number of original values in (55,inf] bin
Tuple Sketches for summarizing birth location attributes as int enumeration
We need to union the sketches to estimate join cardinality of people in the filter
knows connection since 1/1/2000
Tuple Sketches for estimating direction
Structure
Since each index is composed of shards and every shard is an in-depended lucene index, we will add a stats folder which is located in the metadata level of the shard and will store materialized statistics regarding the estimation of the different fields in each shard.
Taking into consideration that many of the sketched are Set Operations capable (union / intersection / difference) we can use this capability to consolidate estimation information arriving form different shards into once combined estimation.
Sketch Definitions
Once an index mapping is being created (template) the mapping format must be enhanced to support the sketch estimators definition
Sketch Queries
A Sketch API must also be also formulated to support queries of the different sketches
####Sketch accuracy
Sketches are only estimators with a pre-defined error, its accuracy is usually measured in terms of Relative Error (RE = Measured/Truth -1). Sketches are stochastic processes and the estimates produced are random variables that have a probability distribution that is close to the familiar Gaussian
####Sketch implementation issues
Sketches are basically cached in every shard memory space and materialized occasionally to file (could be in sync with the refresh rate)
Sketches are not intended to be calculated on the fly and therefor are currently not intended to be used by the aggregation API
Sketches are build on the fly during the indexing of the documents, this can cause some ingestion rate degradation and must be benchmarked for proper integration
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Index Estimators For cost based execution plan
This is a suggestion to enhance the capabilities of the underlying cost based analysis engine to support the usage of sketch estimators.
We would like to be able to use fast and accurate estimators that are pre-calculated in advanced for every index so that we can better (and faster) estimate the cost of the execution path based on a given query.
Currently the lucene file structure does not contain a specific location for the sketch estimators - the first thing that is needed will be adding a '/stats' folder to each index structure that will hold all the materialized sketch files.
The sketch estimators:
We can use https://datasketches.apache.org/ opensource sketch library, this will allow us to take advantage of the following algorithms:
Distinct Counting
Most Frequent
Quantiles And Histograms
Use cases
Given a query we would like to estimate fast and reliable each step in the query and rearrange the execution plan accordingly .
Example
Lets consider the following query:
The query will be translated into a series of steps, each step will be translated to the specific query DSL and pushed down into the underlying search engine.
This query requested finding people over 25 that know people under 55 and their connection period started after 1/1/2000
The following indices are relevant to this query :
Person index has the following fields:
Steps estimation
person age between (25,45]
person age lower than 55 AND p2.birthLocation near NYC
knows connection since 1/1/2000
Structure
Since each index is composed of shards and every shard is an in-depended lucene index, we will add a stats folder which is located in the metadata level of the shard and will store materialized statistics regarding the estimation of the different fields in each shard.
Taking into consideration that many of the sketched are Set Operations capable (union / intersection / difference) we can use this capability to consolidate estimation information arriving form different shards into once combined estimation.
Sketch Definitions
Once an index mapping is being created (template) the mapping format must be enhanced to support the sketch estimators definition
Sketch Queries
A Sketch API must also be also formulated to support queries of the different sketches
####Sketch accuracy
Sketches are only estimators with a pre-defined error, its accuracy is usually measured in terms of Relative Error (RE = Measured/Truth -1). Sketches are stochastic processes and the estimates produced are random variables that have a probability distribution that is close to the familiar Gaussian
####Sketch implementation issues
Beta Was this translation helpful? Give feedback.
All reactions