Partition stuck and does not receive any message. fetch_state:validate-epoch-wait #4492

Phil1972 · 2023-10-31T14:23:03Z

Discussed in #4484

^{Originally posted by Phil1972 October 25, 2023}
Hi,

We do not use transactions for message producer in our implementation. We have many producers and in this simple case, only one consumer process (in a kubernetes pod) which processes all 60 partitions. Messages are spread more or less evenly between producers/partitions. I do not see any evident issues on the producer side.

At some point, one (or more) partition do not receive messages anymore and I can observe a consumer lag on confluent's cluster.

We implemented a partition monitor so that we check when we get no message after a certain period of time. We got into this situation recently where we stopped getting messages from partition 35 (out of 60) and this is what we have in the stats for that partition:

"35": { "partition": 35, "broker": 21, "leader": 21, "desired": true, "unknown": false, "msgq_cnt": 0, "msgq_bytes": 0, "xmit_msgq_cnt": 0, "xmit_msgq_bytes": 0, "fetchq_cnt": 0, "fetchq_size": 0, "fetch_state": "validate-epoch-wait", "query_offset": -1001, "next_offset": 1245176, "app_offset": 1245176, "stored_offset": -1001, "stored_leader_epoch": -1, "commited_offset": 1245176, "committed_offset": 1245176, "committed_leader_epoch": 120, "eof_offset": 1245176, "lo_offset": -1, "hi_offset": -1, "ls_offset": -1, "consumer_lag": -1, "consumer_lag_stored": -1, "leader_epoch": 121, "txmsgs": 0, "txbytes": 0, "rxmsgs": 615, "rxbytes": 710580, "msgs": 615, "rx_ver_drops": 0, "msgs_inflight": 0, "next_ack_seq": 0, "next_err_seq": 0, "acked_msgid": 0 },

what is the meaning of validate-epoch-wait ?

We plan to restart the pod whenever we encounter this type of issue since we do not know how to recover from this nor do we know why it happens.

Any insight or help on this issue would be appreciated,

Thanks,

Confluent.Kafka nuget version 2.2.0

The text was updated successfully, but these errors were encountered:

Phil1972 · 2023-10-31T14:29:28Z

We had many occurrences since saturday and today I noticed that there is some typos in the stats as well generating too many entries:
"50": { "partition": 50, "broker": 16, "leader": 16, "desired": true, "unknown": false, "msgq_cnt": 0, "msgq_bytes": 0, "xmit_msgq_cnt": 0, "xmit_msgq_bytes": 0, "fetchq_cnt": 0, "fetchq_size": 0, "fetch_state": "validate-epoch-wait", "query_offset": -1001, "next_offset": 4722417, "app_offset": 4722417, "stored_offset": -1001, "stored_leader_epoch": -1, "commited_offset": 4722417, "committed_offset": 4722417, "committed_leader_epoch": 178, "eof_offset": 4722416, "lo_offset": -1, "hi_offset": -1, "ls_offset": -1, "consumer_lag": -1, "consumer_lag_stored": -1, "leader_epoch": 179, "txmsgs": 0, "txbytes": 0, "rxmsgs": 3128, "rxbytes": 3913726, "msgs": 3128, "rx_ver_drops": 0, "msgs_inflight": 0, "next_ack_seq": 0, "next_err_seq": 0, "acked_msgid": 0 },

I think there are typos that generates redondant info in the json:
"commited_offset": 4722417,
"committed_offset": 4722417,

Phil1972 · 2023-10-31T15:15:24Z

We noticed some Kafka Errors that are flagged as non-fatal prior to the issue (most of the time but not always):

`"{"@t":"2023-10-31T05:26:33.5182985Z","@mt":"Kafka logs: [thrd:sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16]: sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16: Timed out 1337 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests","@l":"Warning"}"
"{"@t":"2023-10-31T05:26:33.5188196Z","@mt":"Kafka logs: [thrd:sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16]: sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16: 1337 request(s) timed out: disconnect (average rtt 2373.701ms) (after 82773ms in state UP)","@l":"Error"}"

"{"@t":"2023-10-31T05:26:33.5212463Z","@mt":"Kafka consumer error: {ErrorCode} because {Reason} (IsFatal: {IsFatal}, broker error: {IsBrokerError}, local error : {IsLocalError})","@l":"Fatal","ErrorCode":"Local_TimedOut","Reason":"sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16: 1337 request(s) timed out: disconnect (average rtt 2373.701ms) (after 82773ms in state UP)","IsFatal":false,"IsBrokerError":false,"IsLocalError":true}"

...

And then our partition monitoring mechanism kicks in:

"{""@t"":""2023-10-31T05:30:02.1151066Z"",""@mt"":""No message received for assigned partitions [Topid [[5]],Topic [[14]]] since 00:05:20.8702035(Threshold: 00:05:00). Partitions jsons: {""partition"":5,""broker"":16,""leader"":16,""desired"":true,""unknown"":false,""msgq_cnt"":0,""msgq_bytes"":0,""xmit_msgq_cnt"":0,""xmit_msgq_bytes"":0,""fetchq_cnt"":0,""fetchq_size"":0,""fetch_state"":""validate-epoch-wait"",""query_offset"":-1001,""next_offset"":4727048,""app_offset"":4727048,""stored_offset"":-1001,""stored_leader_epoch"":-1,""commited_offset"":4727048,""committed_offset"":4727048,""committed_leader_epoch"":177,""eof_offset"":4727048,""lo_offset"":-1,""hi_offset"":-1,""ls_offset"":-1,""consumer_lag"":-1,""consumer_lag_stored"":-1,""leader_epoch"":178,""txmsgs"":0,""txbytes"":0,""rxmsgs"":3269,""rxbytes"":3965587,""msgs"":3269,""rx_ver_drops"":0,""msgs_inflight"":0,""next_ack_seq"":0,""next_err_seq"":0,""acked_msgid"":0}\n{""partition"":14,""broker"":16,""leader"":16,""desired"":true,""unknown"":false,""msgq_cnt"":0,""msgq_bytes"":0,""xmit_msgq_cnt"":0,""xmit_msgq_bytes"":0,""fetchq_cnt"":0,""fetchq_size"":0,""fetch_state"":""validate-epoch-wait"",""query_offset"":-1001,""next_offset"":4738019,""app_offset"":4738019,""stored_offset"":-1001,""stored_leader_epoch"":-1,""commited_offset"":4738019,""committed_offset"":4738019,""committed_leader_epoch"":164,""eof_offset"":4738019,""lo_offset"":-1,""hi_offset"":-1,""ls_offset"":-1,""consumer_lag"":-1,""consumer_lag_stored"":-1,""leader_epoch"":165,""txmsgs"":0,""txbytes"":0,""rxmsgs"":3308,""rxbytes"":4052406,""msgs"":3308,""rx_ver_drops"":0,""msgs_inflight"":0,""next_ack_seq"":0,""next_err_seq"":0,""acked_msgid"":0}"
`

emasab · 2023-10-31T16:02:55Z

@Phil1972 it seems like you're hitting this bug #4433 .
It's present in 2.1.x and 2.2.0. Could you upgrade to 2.3.0 and check if it solves your issue?

Phil1972 · 2023-10-31T16:06:55Z

Ok, I will try this fix but the issue occurs without any partition reshuffling as well since in some of our environments, we only have one pod handling all the partitions . In any case, I will give this a try and come back with feedback. Thanks,

…

________________________________ From: Emanuele Sabellico ***@***.***> Sent: Tuesday, October 31, 2023 12:03:07 PM To: confluentinc/librdkafka ***@***.***> Cc: Phil1972 ***@***.***>; Mention ***@***.***> Subject: Re: [confluentinc/librdkafka] Partition stuck and does not receive any message. fetch_state:validate-epoch-wait (Issue #4492) @Phil1972<https://github.com/Phil1972> it seems like you're hitting this bug #4433<#4433> . It's present in 2.1.x and 2.2.0. Could you upgrade to 2.3.0 and check if it solves your issue? — Reply to this email directly, view it on GitHub<#4492 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJT6TC5QSNEFCNGRZNVSKTTYCEOLXAVCNFSM6AAAAAA6XZC3MCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBXGUZDCMBXHA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Phil1972 · 2023-10-31T17:30:43Z

@Phil1972 it seems like you're hitting this bug #4433 . It's present in 2.1.x and 2.2.0. Could you upgrade to 2.3.0 and check if it solves your issue?

also, any information on the fetch_state undocumented value? what does it means when a partition is in that state?

ie: validate-epoch-wait

thanks,

emasab · 2023-11-01T14:54:49Z

@Phil1972 that is an internal partition state that means the offset validation is in progress. It was implemented as part of KIP-320 to prevent offsets resets in case of unclean leader elections. The bug could happen with two subsequent partition leader changes (broker side), when the first validation got an error and is retrying. Can happen with a single consumer too.

emasab · 2023-11-16T17:50:42Z

Closing it: upgrading to v2.3.0 should solve this issue. Reopen otherwise

Phil1972 changed the title ~~Partition stuck and do not receive any message. fetch_state:validate-epoch-wait~~ Partition stuck and does not receive any message. fetch_state:validate-epoch-wait Oct 31, 2023

emasab added the wait-info label Oct 31, 2023

emasab closed this as completed Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition stuck and does not receive any message. fetch_state:validate-epoch-wait #4492

Partition stuck and does not receive any message. fetch_state:validate-epoch-wait #4492

Phil1972 commented Oct 31, 2023 •

edited

Loading

Phil1972 commented Oct 31, 2023 •

edited

Loading

Phil1972 commented Oct 31, 2023 •

edited

Loading

emasab commented Oct 31, 2023

Phil1972 commented Oct 31, 2023 via email

Phil1972 commented Oct 31, 2023

emasab commented Nov 1, 2023 •

edited

Loading

emasab commented Nov 16, 2023

Partition stuck and does not receive any message. fetch_state:validate-epoch-wait #4492

Partition stuck and does not receive any message. fetch_state:validate-epoch-wait #4492

Comments

Phil1972 commented Oct 31, 2023 • edited Loading

Discussed in #4484

Phil1972 commented Oct 31, 2023 • edited Loading

Phil1972 commented Oct 31, 2023 • edited Loading

emasab commented Oct 31, 2023

Phil1972 commented Oct 31, 2023 via email

Phil1972 commented Oct 31, 2023

emasab commented Nov 1, 2023 • edited Loading

emasab commented Nov 16, 2023

Phil1972 commented Oct 31, 2023 •

edited

Loading

Phil1972 commented Oct 31, 2023 •

edited

Loading

Phil1972 commented Oct 31, 2023 •

edited

Loading

emasab commented Nov 1, 2023 •

edited

Loading