-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add multi indices search to elasticsearch + quickwit search endpoints. #3734
Conversation
0dd67c2
to
f18e542
Compare
96d79cf
to
4ea81ea
Compare
@fmassot is it still draft ? |
quickwit/quickwit-metastore/src/metastore/file_backed_metastore/mod.rs
Outdated
Show resolved
Hide resolved
quickwit/quickwit-metastore/src/metastore/file_backed_metastore/mod.rs
Outdated
Show resolved
Hide resolved
quickwit/quickwit-metastore/src/metastore/postgresql_metastore.rs
Outdated
Show resolved
Hide resolved
// If no splits were returned, maybe the index does not exist in the first place? | ||
if pg_splits.is_empty() | ||
&& index_opt_for_uid(&self.connection_pool, query.index_uid.clone()) | ||
// If no splits were returned, maybe some indexes do not exist in the first place? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this check is a pain in the butt to implement, especially in the multi-index case, but there are some places (REST API, Janitor) where it could prove helpful. @fulmicoton, thoughts on dropping it?
quickwit/quickwit-metastore/src/metastore/postgresql_metastore.rs
Outdated
Show resolved
Hide resolved
quickwit/quickwit-metastore/src/metastore/instrumented_metastore.rs
Outdated
Show resolved
Hide resolved
quickwit/quickwit-metastore/src/metastore/file_backed_metastore/mod.rs
Outdated
Show resolved
Hide resolved
// We convert the error to return a 400 to the user (and not a 500). | ||
.map_err(|err| SearchError::InvalidQuery(err.to_string()))?; | ||
|
||
// Validate uniqueness of resolved query AST. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had three possible specs here:
a) pick the query_ast_resolved
of a given index.. Maybe the freshest one, or the first defined in the list.
b) use a different query_ast_resolved
for each indexed.
c) return an error when facing an inconsistency (that's what you picked)
Obviously this is a trade-off, but I strongly prefer b).
I don't think there is a strong case to be defensive here. This is actually an excellent use case for the default field.
I have several indexes for a security use case. One containing logs, one containing transactions, etc. I would like to search into all of them to gather all docs containing a given reference number. The default_fields in the different index makes the search experience possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, I think we should merge this BUT I think we should have gone with one query ast resolved per index.
This also had some wrangling on the models of the LeafRequest.
A dead simple implementation could have been something:
phase1: keep the leaf request unchanged (as in like it is in main) and do emit several gRPC
to each node, apply the existing merging logic on the root side.
phase2: add a batch leaf_request GRPC endpoint, that computes the different leaf request and pre-merge their results before sending the result back to the root.
If you agree, can we open a ticket to revisit this?
8354b2b
to
47f12e7
Compare
47f12e7
to
afd797c
Compare
ListIndexesQuery::All => build_regex_set_from_patterns(vec!["*".to_string()]), | ||
}; | ||
let index_matcher = | ||
index_matcher_result.map_err(|error| MetastoreError::InternalError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I worked on retries yesterday, I noticed we retry on internal errors. However, we often use internal errors for invalid argument errors that shouldn't be retried. We should introduce a MetastoreError::InvalidArgument
that is not retryable. I can take care of that as part of #3752 where I've already started refactoring metastore errors.
indexes_metadatas: &[IndexMetadata], | ||
search_request: &SearchRequest, | ||
) -> crate::Result<(TimestampFieldOpt, QueryAst, IndexesMetasForLeafSearch)> { | ||
let mut metadatas_for_leaf: HashMap<IndexUid, IndexMetasForLeafSearch> = HashMap::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut metadatas_for_leaf: HashMap<IndexUid, IndexMetasForLeafSearch> = HashMap::new(); | |
let mut metadatas_for_leaf: HashMap<IndexUid, IndexMetasForLeafSearch> = HashMap::with_capacity( indexes_metadatas.len()); |
Fix #2770.
This is a partial implementation of the multi-index syntax of elasticsearch, which is defined here: only comma-separated and wildcard syntax are supported in this PR.
Examples:
/index-*/search
/index-1,index-2/search
/_elastic/*/_search
Limitations
NB: those limitations can be lifted, but it adds complexity due to managing multiple
QueryAST
,TagFilterAst
, andtimestamp_field
.My opinion on this is that the limitations are reasonable from the user's point of view.
Queries on multiple indexes that lead to different
QueryAst
per index are not supported.Example:
field1
andfield2
, with the default field onfield1
.field1
andfield2
, with the default field onfield2
.A query like
test
will be resolved as:field1:test
on index 1.field2:test
on index 2.Queries on multiple indexes with different timestamp fields per index are not supported. Note that if one index has no timestamp field, but the others have, this does not raise any issue.
Supporting different timestamp fields will require to:
refine_start_end_timestamp_from_ast
for each timestamp field of each index and extract the start/end timestamp from the queryAlso, it can lead to side effects in query performance on corner cases (start and end timestamps can be very narrowed on all indexes but one, the last one has no refinements on start/end timestamps, and we end up with no refinement. So adding just one index to the query will lead to a very different performance).
I implemented a version of it, but I decided not to add it to this PR. I would rather add it if users need this.
Follow-up in next PRs
IndexesDoNotExistError
. In the PostgreSQL metastore, we only check the existence of index IDs if no splits are returned. I suggest not checking for index existence in thelist_split
method and fixing that in a follow-up PR.TODO
list_all_indexes_metadatas
as it is redundant withlist_indexes_metadatas