Bulk API return document level 500 when new shards are being allocated #5565

esatterwhite · 2024-11-26T17:19:01Z

Describe the bug
Under certain conditions using the elasticsearch _bulk api, quickwit, particularly during spikes in ingestion traffic, or when an index is initially created, quickwit will reject documents with a document level status of 500, internal_exception with the reason no shards available. This tends to indicate that something has gone wrong on the server and the document cannot be retried

{
  status: 500
, error: {
    type: 'internal_exception'
  , reason: 'no shards available'
  }
}

This can cause problems when using existing elasticsearch client libraries. Many of them have logic implemented for handing retires and document level errors from the bulk api. However, the rate limiting generally only kicks in when the document level status is a 429. This can be problematic for existing applications where the retry logic is leveraged. In the current quickwit behavior, documents will generally be dropped assuming the error is terminal when its really a transient warmup problem.

Expected behavior
The bulk api document errors should be a 429 when there are no shards available. It may also be helpful to return a error code that is more indicative of the problem rather than an `internal_exception

{
  status: 429
, error: {
    type: 'no_shard_available_action_exception' // elasticsearch has this error code, but it may mean something else in that context.
  , reason: 'no shards available'
  }
}

The text was updated successfully, but these errors were encountered:

fulmicoton · 2024-11-27T08:38:17Z

One trouble is that we actually don't want you to retry right away in that case. Maybe we should set an informative retry_after header? (500ms maybe)

esatterwhite · 2024-11-27T11:15:52Z

One trouble is that we actually don't want you to retry right away in that case. Maybe we should set an informative retry_after header? (500ms maybe)

Es clients don't always return response headers. But it's not a bad idea. Maybe in the metadata returned in the response, and header.

Retry-After: 500 is the standard header for rate limiting
We do have delays on the retries though

esatterwhite · 2024-11-27T11:55:02Z

I will try to trace down how the error metadata is constructed to see if it can be influenced.

esatterwhite · 2024-11-27T13:25:52Z

Yeah, I don't see anything that relays response headers through the elasticsearch client libs. Thats not to say it isn't a good idea. If that kind of info can be transmitting back, its should.

I think in the case of the bulk api, it would probably need to be included in the document level responses as well though. That is expected pattern for that one as its kind of a multi-response situation.

esatterwhite added the bug Something isn't working label Nov 26, 2024

rdettai mentioned this issue Nov 27, 2024

Return 429 on ES API when no shards available #5566

Merged

rdettai closed this as completed in #5566 Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk API return document level 500 when new shards are being allocated #5565

Bulk API return document level 500 when new shards are being allocated #5565

esatterwhite commented Nov 26, 2024

fulmicoton commented Nov 27, 2024

esatterwhite commented Nov 27, 2024 •

edited

Loading

esatterwhite commented Nov 27, 2024

esatterwhite commented Nov 27, 2024

Bulk API return document level 500 when new shards are being allocated #5565

Bulk API return document level 500 when new shards are being allocated #5565

Comments

esatterwhite commented Nov 26, 2024

fulmicoton commented Nov 27, 2024

esatterwhite commented Nov 27, 2024 • edited Loading

esatterwhite commented Nov 27, 2024

esatterwhite commented Nov 27, 2024

esatterwhite commented Nov 27, 2024 •

edited

Loading