-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Support elasticsearch field capabilities endpoint #3527
Comments
The field capabilities API returns information for all indices. In a scenario with a lot of splits but few fields, the merging could become relative expensive. (Need some benchmarks/or more info about the merge logic) |
Not the Let's skip the preaggregate for the moment maybe, and do something simple. E.g.
To be consistent with the rest of quickwit, we should use the capability as defined in the current docmapper (not the one of the split). |
ES RequirementsTo be able to map to ES https://www.elastic.co/guide/en/elasticsearch/reference/7.17/search-field-caps.html#search-field-caps-api-response-body we need following information:
BundledataThe data to the split bundle could look like this. struct SplitFields {
/// Path to Field
/// May contain duplicates for different ColumnTypes
field_names: Vec<String>, // or SSTable
/// Orthogonal to field_names, `(field_name, ord) -> config[config_ords[ord]]`
config_ords: Vec<u16>, // probably bitpacked
configs: Vec<FieldConfig>
}
struct FieldConfig {stored: bool, fast: bool, type: ColumnType, indexed: bool} Serialized FormatTODO (protobuf?) QuestionsIs this is planned to be included in the Hotcache? Should we have a size limit since there could be a lot of fields? TantivyFast fields and indexed (searchable) fields are quite differently handled in tantivy. Fast fieldsAll fields names (JSON and regular mixed) in SSTable -> DynamicColumn Indexed fieldsRegular Extracting all field names for the indexed fields requires a full scan (packager step). We may consider adding a optional list of the fields next to the TermDict in tantivy for low cardinality field cases. Scan implemented in the sstable allows efficient skipping, when collecting the fields (not implemented yet). |
DifferencesMultiple TypesDifferent types on one field name on one index are not possible in elastic search. The multi mapping feature also assigns a new field name, e.g. The field capabilities endpoint still has multiple assigned to one field name, since different indices may have different types assigned to one field name. Field Metadata Example {
"indices": [ "index1", "index2", "index3", "index4", "index5" ],
"fields": {
"rating": {
"long": {
"metadata_field": false,
"searchable": true,
"aggregatable": false,
"indices": [ "index1", "index2" ],
"non_aggregatable_indices": [ "index1" ]
},
"keyword": {
"metadata_field": false,
"searchable": false,
"aggregatable": true,
"indices": [ "index3", "index4" ],
"non_searchable_indices": [ "index4" ]
}
},
"title": {
"text": {
"metadata_field": false,
"searchable": true,
"aggregatable": false
}
}
}
}
|
Caching of field listJust writing some considerations down of the current state and possibilities. Hotcache Variant
New cache for list fields
Alternatively it could be a mix of new cache and Hotcache. |
Closing in favor of #4298 |
Elasticsearch doc ref
Like we do for search, we want to have a gRPC endpoint that provides JSON information that makes sense in Quickwit (using quickwit lingua). The gRPC endpoints does NOT offer a field filter. It returns the data for everything.
The elastic search endpoint will then be a transcription of the field capabilities protobuf into the Elasticsearch response.
Implementation wise, we want all of the necessary information, in an extra file of the bundle, (probably in protobuf format + compressed). If it is judged not (efficient + well compressed) enough.
SSTable -> Protobuf
is also a possibility.The packager should be in charge of populating that data.
The field capabilities endpoint would then fetch the data for all of the splits and merge it.
Finally we want to cache locally this data, a bit like we cache the hotcache today.
We don't need to bother reducing the cost of merge for the moment.
The text was updated successfully, but these errors were encountered: