Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Support elasticsearch field capabilities endpoint #3527

Closed
6 tasks done
Tracked by #2653
fmassot opened this issue Jun 7, 2023 · 6 comments
Closed
6 tasks done
Tracked by #2653

Support elasticsearch field capabilities endpoint #3527

fmassot opened this issue Jun 7, 2023 · 6 comments
Assignees

Comments

@fmassot
Copy link
Collaborator

fmassot commented Jun 7, 2023

Elasticsearch doc ref

Like we do for search, we want to have a gRPC endpoint that provides JSON information that makes sense in Quickwit (using quickwit lingua). The gRPC endpoints does NOT offer a field filter. It returns the data for everything.
The elastic search endpoint will then be a transcription of the field capabilities protobuf into the Elasticsearch response.

Implementation wise, we want all of the necessary information, in an extra file of the bundle, (probably in protobuf format + compressed). If it is judged not (efficient + well compressed) enough. SSTable -> Protobuf is also a possibility.
The packager should be in charge of populating that data.

The field capabilities endpoint would then fetch the data for all of the splits and merge it.
Finally we want to cache locally this data, a bit like we cache the hotcache today.

We don't need to bother reducing the cost of merge for the moment.

  • spec the data that should be added to the bundle
  • implement the logic to build it in the packager
  • implement the logic to merge the field capability data
  • implement the gRPC endpoint
  • implement the elasticsearch REST endpoint facade
  • add storage caching at the storage level for the field capability data
@PSeitz
Copy link
Contributor

PSeitz commented Sep 13, 2023

The field capabilities API returns information for all indices.
What information does the UI require?

In a scenario with a lot of splits but few fields, the merging could become relative expensive. (Need some benchmarks/or more info about the merge logic)
In that case, the caching could store some pre-aggregate, that would be invalidated if a split in an aggregation would get removed.

@fulmicoton
Copy link
Collaborator

The field capabilities API returns information for all indices.

Not the GET /<target>/_field_caps?fields=<fields> endpoint.

Let's skip the preaggregate for the moment maybe, and do something simple.
If you use sstable merging should be fast enough.

E.g.

  • we can either precompute some datastruct in packaging phase OR if the payload is too large, just rely on the leaf cache (currently called partial result cache).
  • Only consider the last N splits (N=100) and abort merging and return results right away if it takes too long.

To be consistent with the rest of quickwit, we should use the capability as defined in the current docmapper (not the one of the split).
So if we have a fast field in the dynamic field named "age" in split 0, that split was not a fast field, but is a fastfield in the current doc mapper, it should be considered a fast field.

@PSeitz
Copy link
Contributor

PSeitz commented Sep 22, 2023

ES Requirements

To be able to map to ES https://www.elastic.co/guide/en/elasticsearch/reference/7.17/search-field-caps.html#search-field-caps-api-response-body we need following information:

  • Name of the field
  • Field type information to map to an ES equivalent
    • Is date type with precision lower than ns equivalent to ES date or date_nanos? (Probably always date_nanos)
    • ES text vs keyword: Is text only non-raw tokenizer?
  • Is field searchable (is indexed)
  • Is field aggregatable (is fast field)

Bundledata

The data to the split bundle could look like this.
This does not include field specific information currently which is tokenizer and date time precision.

struct SplitFields {
    /// Path to Field
    /// May contain duplicates for different ColumnTypes
    field_names: Vec<String>, // or SSTable
    /// Orthogonal to field_names,  `(field_name, ord) -> config[config_ords[ord]]`
    config_ords: Vec<u16>, // probably bitpacked 
    configs: Vec<FieldConfig>
}

struct FieldConfig {stored: bool, fast: bool, type: ColumnType, indexed: bool}

Serialized Format

TODO (protobuf?)

Questions

Is this is planned to be included in the Hotcache? Should we have a size limit since there could be a lot of fields?

Tantivy

Fast fields and indexed (searchable) fields are quite differently handled in tantivy.

Fast fields

All fields names (JSON and regular mixed) in SSTable -> DynamicColumn

Indexed fields

Regular Field (e.g. u64, i64, text) -> InvertedIndex -> TermDict -> [VALUE]
JSON Field -> InvertedIndex -> TermDict -> [JSON_PATH][JSON_PATH_END][VALUETYPE][VALUE]

Extracting all field names for the indexed fields requires a full scan (packager step). We may consider adding a optional list of the fields next to the TermDict in tantivy for low cardinality field cases. Scan implemented in the sstable allows efficient skipping, when collecting the fields (not implemented yet).

@PSeitz
Copy link
Contributor

PSeitz commented Nov 29, 2023

Differences

Multiple Types

Different types on one field name on one index are not possible in elastic search. The multi mapping feature also assigns a new field name, e.g. myfield and myfield.keyword.

The field capabilities endpoint still has multiple assigned to one field name, since different indices may have different types assigned to one field name.
In contrast, in quickwit and tantivy one field name on a Index can have multiple types.

Field Metadata Example

{
  "indices": [ "index1", "index2", "index3", "index4", "index5" ],
  "fields": {
    "rating": {                                   
      "long": {
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false,
        "indices": [ "index1", "index2" ],
        "non_aggregatable_indices": [ "index1" ]  
      },
      "keyword": {
        "metadata_field": false,
        "searchable": false,
        "aggregatable": true,
        "indices": [ "index3", "index4" ],
        "non_searchable_indices": [ "index4" ]    
      }
    },
    "title": {                                    
      "text": {
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    }
  }
}

@PSeitz
Copy link
Contributor

PSeitz commented Dec 11, 2023

Caching of field list

Just writing some considerations down of the current state and possibilities.
Currently a file is created on the split containing the list of fields (zstd compressed).

Hotcache Variant

  • Add current list fields file of the split to Hotcache

    • Pro: Easy.
    • Cons: The Hotcache is over tantivy files, therefore the list fields file would become become transient, only living in the Hotcache. Weird special case.
  • Move list fields file from split to tantivy, then add to Hotcache.

    • Pro: Faster indexing + merging, since fields can be collected during indexing and does not require a full scan of the inverted index afterwards. Merge code does not require a full scan. (Missing data point: How fast is a dict fullscan) Tantivy knows the fields and could use it in queries.
    • Cons: Slightly more complex, to cover indexing and merging.

New cache for list fields

  • Pro: Good for the high cardinality case, where the Hotcache could become relatively big. E.g. 100_000 fields, where each field ztd compressed costs 10bytes => 1MB. Could handle some aggregation caching.
  • Cons: Since the fields are not in the Hotcache, populating the cache would require an additional request per split after startup.

Alternatively it could be a mix of new cache and Hotcache.

@PSeitz
Copy link
Contributor

PSeitz commented Jan 19, 2024

Closing in favor of #4298

@PSeitz PSeitz closed this as completed Jan 19, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

3 participants