Support elasticsearch field capabilities endpoint #3527

fmassot · 2023-06-07T23:13:30Z

Like we do for search, we want to have a gRPC endpoint that provides JSON information that makes sense in Quickwit (using quickwit lingua). The gRPC endpoints does NOT offer a field filter. It returns the data for everything.
The elastic search endpoint will then be a transcription of the field capabilities protobuf into the Elasticsearch response.

Implementation wise, we want all of the necessary information, in an extra file of the bundle, (probably in protobuf format + compressed). If it is judged not (efficient + well compressed) enough. SSTable -> Protobuf is also a possibility.
The packager should be in charge of populating that data.

The field capabilities endpoint would then fetch the data for all of the splits and merge it.
Finally we want to cache locally this data, a bit like we cache the hotcache today.

We don't need to bother reducing the cost of merge for the moment.

spec the data that should be added to the bundle
implement the logic to build it in the packager
implement the logic to merge the field capability data
implement the gRPC endpoint
implement the elasticsearch REST endpoint facade
add storage caching at the storage level for the field capability data

The text was updated successfully, but these errors were encountered:

PSeitz · 2023-09-13T13:41:47Z

The field capabilities API returns information for all indices.
What information does the UI require?

In a scenario with a lot of splits but few fields, the merging could become relative expensive. (Need some benchmarks/or more info about the merge logic)
In that case, the caching could store some pre-aggregate, that would be invalidated if a split in an aggregation would get removed.

fulmicoton · 2023-09-14T07:11:19Z

The field capabilities API returns information for all indices.

Not the GET /<target>/_field_caps?fields=<fields> endpoint.

Let's skip the preaggregate for the moment maybe, and do something simple.
If you use sstable merging should be fast enough.

E.g.

we can either precompute some datastruct in packaging phase OR if the payload is too large, just rely on the leaf cache (currently called partial result cache).
Only consider the last N splits (N=100) and abort merging and return results right away if it takes too long.

To be consistent with the rest of quickwit, we should use the capability as defined in the current docmapper (not the one of the split).
So if we have a fast field in the dynamic field named "age" in split 0, that split was not a fast field, but is a fastfield in the current doc mapper, it should be considered a fast field.

PSeitz · 2023-09-22T09:21:27Z

ES Requirements

To be able to map to ES https://www.elastic.co/guide/en/elasticsearch/reference/7.17/search-field-caps.html#search-field-caps-api-response-body we need following information:

Name of the field
Field type information to map to an ES equivalent
- Is date type with precision lower than ns equivalent to ES date or date_nanos? (Probably always date_nanos)
- ES text vs keyword: Is text only non-raw tokenizer?
Is field searchable (is indexed)
Is field aggregatable (is fast field)

Bundledata

The data to the split bundle could look like this.
This does not include field specific information currently which is tokenizer and date time precision.

struct SplitFields {
    /// Path to Field
    /// May contain duplicates for different ColumnTypes
    field_names: Vec<String>, // or SSTable
    /// Orthogonal to field_names,  `(field_name, ord) -> config[config_ords[ord]]`
    config_ords: Vec<u16>, // probably bitpacked 
    configs: Vec<FieldConfig>
}

struct FieldConfig {stored: bool, fast: bool, type: ColumnType, indexed: bool}

Serialized Format

TODO (protobuf?)

Questions

Is this is planned to be included in the Hotcache? Should we have a size limit since there could be a lot of fields?

Tantivy

Fast fields and indexed (searchable) fields are quite differently handled in tantivy.

Fast fields

All fields names (JSON and regular mixed) in SSTable -> DynamicColumn

Indexed fields

Regular Field (e.g. u64, i64, text) -> InvertedIndex -> TermDict -> [VALUE]
JSON Field -> InvertedIndex -> TermDict -> [JSON_PATH][JSON_PATH_END][VALUETYPE][VALUE]

Extracting all field names for the indexed fields requires a full scan (packager step). We may consider adding a optional list of the fields next to the TermDict in tantivy for low cardinality field cases. Scan implemented in the sstable allows efficient skipping, when collecting the fields (not implemented yet).

PSeitz · 2023-11-29T03:39:20Z

Differences

Multiple Types

Different types on one field name on one index are not possible in elastic search. The multi mapping feature also assigns a new field name, e.g. myfield and myfield.keyword.

The field capabilities endpoint still has multiple assigned to one field name, since different indices may have different types assigned to one field name.
In contrast, in quickwit and tantivy one field name on a Index can have multiple types.

Field Metadata Example

{
  "indices": [ "index1", "index2", "index3", "index4", "index5" ],
  "fields": {
    "rating": {                                   
      "long": {
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false,
        "indices": [ "index1", "index2" ],
        "non_aggregatable_indices": [ "index1" ]  
      },
      "keyword": {
        "metadata_field": false,
        "searchable": false,
        "aggregatable": true,
        "indices": [ "index3", "index4" ],
        "non_searchable_indices": [ "index4" ]    
      }
    },
    "title": {                                    
      "text": {
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    }
  }
}

PSeitz · 2023-12-11T06:59:36Z

Caching of field list

Just writing some considerations down of the current state and possibilities.
Currently a file is created on the split containing the list of fields (zstd compressed).

Hotcache Variant

Add current list fields file of the split to Hotcache
- Pro: Easy.
- Cons: The Hotcache is over tantivy files, therefore the list fields file would become become transient, only living in the Hotcache. Weird special case.
Move list fields file from split to tantivy, then add to Hotcache.
- Pro: Faster indexing + merging, since fields can be collected during indexing and does not require a full scan of the inverted index afterwards. Merge code does not require a full scan. (Missing data point: How fast is a dict fullscan) Tantivy knows the fields and could use it in queries.
- Cons: Slightly more complex, to cover indexing and merging.

New cache for list fields

Pro: Good for the high cardinality case, where the Hotcache could become relatively big. E.g. 100_000 fields, where each field ztd compressed costs 10bytes => 1MB. Could handle some aggregation caching.
Cons: Since the fields are not in the Hotcache, populating the cache would require an additional request per split after startup.

Alternatively it could be a mix of new cache and Hotcache.

PSeitz · 2024-01-19T04:20:25Z

Closing in favor of #4298

fmassot added enhancement New feature or request elasticsearch-api project:airmail labels Jun 7, 2023

fmassot mentioned this issue Jun 21, 2023

ES-compatible API #2653

Closed

28 tasks

fulmicoton assigned PSeitz Aug 13, 2023

fulmicoton mentioned this issue Aug 15, 2023

Add ability to fetch keys, cardinalities and types of dynamic fields #3274

Closed

PSeitz mentioned this issue Dec 15, 2023

add list_fields api #4242

Merged

9 tasks

This was referenced Jan 5, 2024

Small improvements on the ES field cap endpoint #4298

Open

add _elastic/_field_caps API #4350

Merged

PSeitz closed this as completed Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support elasticsearch field capabilities endpoint #3527

Support elasticsearch field capabilities endpoint #3527

fmassot commented Jun 7, 2023 •

edited by PSeitz

Loading

PSeitz commented Sep 13, 2023

fulmicoton commented Sep 14, 2023

PSeitz commented Sep 22, 2023 •

edited

Loading

PSeitz commented Nov 29, 2023

PSeitz commented Dec 11, 2023

PSeitz commented Jan 19, 2024

Support elasticsearch field capabilities endpoint #3527

Support elasticsearch field capabilities endpoint #3527

Comments

fmassot commented Jun 7, 2023 • edited by PSeitz Loading

PSeitz commented Sep 13, 2023

fulmicoton commented Sep 14, 2023

PSeitz commented Sep 22, 2023 • edited Loading

ES Requirements

Bundledata

Serialized Format

Questions

Tantivy

Fast fields

Indexed fields

PSeitz commented Nov 29, 2023

Differences

Multiple Types

PSeitz commented Dec 11, 2023

Caching of field list

Hotcache Variant

New cache for list fields

PSeitz commented Jan 19, 2024

fmassot commented Jun 7, 2023 •

edited by PSeitz

Loading

PSeitz commented Sep 22, 2023 •

edited

Loading