-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Partition stuck and does not receive any message. fetch_state:validate-epoch-wait #4492
Comments
We had many occurrences since saturday and today I noticed that there is some typos in the stats as well generating too many entries: I think there are typos that generates redondant info in the json: |
We noticed some Kafka Errors that are flagged as non-fatal prior to the issue (most of the time but not always): `"{"@t":"2023-10-31T05:26:33.5182985Z","@mt":"Kafka logs: [thrd:sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16]: sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16: Timed out 1337 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests","@l":"Warning"}" "{"@t":"2023-10-31T05:26:33.5212463Z","@mt":"Kafka consumer error: {ErrorCode} because {Reason} (IsFatal: {IsFatal}, broker error: {IsBrokerError}, local error : {IsLocalError})","@l":"Fatal","ErrorCode":"Local_TimedOut","Reason":"sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16: 1337 request(s) timed out: disconnect (average rtt 2373.701ms) (after 82773ms in state UP)","IsFatal":false,"IsBrokerError":false,"IsLocalError":true}" ... And then our partition monitoring mechanism kicks in: "{""@t"":""2023-10-31T05:30:02.1151066Z"",""@mt"":""No message received for assigned partitions [Topid [[5]],Topic [[14]]] since 00:05:20.8702035(Threshold: 00:05:00). Partitions jsons: {""partition"":5,""broker"":16,""leader"":16,""desired"":true,""unknown"":false,""msgq_cnt"":0,""msgq_bytes"":0,""xmit_msgq_cnt"":0,""xmit_msgq_bytes"":0,""fetchq_cnt"":0,""fetchq_size"":0,""fetch_state"":""validate-epoch-wait"",""query_offset"":-1001,""next_offset"":4727048,""app_offset"":4727048,""stored_offset"":-1001,""stored_leader_epoch"":-1,""commited_offset"":4727048,""committed_offset"":4727048,""committed_leader_epoch"":177,""eof_offset"":4727048,""lo_offset"":-1,""hi_offset"":-1,""ls_offset"":-1,""consumer_lag"":-1,""consumer_lag_stored"":-1,""leader_epoch"":178,""txmsgs"":0,""txbytes"":0,""rxmsgs"":3269,""rxbytes"":3965587,""msgs"":3269,""rx_ver_drops"":0,""msgs_inflight"":0,""next_ack_seq"":0,""next_err_seq"":0,""acked_msgid"":0}\n{""partition"":14,""broker"":16,""leader"":16,""desired"":true,""unknown"":false,""msgq_cnt"":0,""msgq_bytes"":0,""xmit_msgq_cnt"":0,""xmit_msgq_bytes"":0,""fetchq_cnt"":0,""fetchq_size"":0,""fetch_state"":""validate-epoch-wait"",""query_offset"":-1001,""next_offset"":4738019,""app_offset"":4738019,""stored_offset"":-1001,""stored_leader_epoch"":-1,""commited_offset"":4738019,""committed_offset"":4738019,""committed_leader_epoch"":164,""eof_offset"":4738019,""lo_offset"":-1,""hi_offset"":-1,""ls_offset"":-1,""consumer_lag"":-1,""consumer_lag_stored"":-1,""leader_epoch"":165,""txmsgs"":0,""txbytes"":0,""rxmsgs"":3308,""rxbytes"":4052406,""msgs"":3308,""rx_ver_drops"":0,""msgs_inflight"":0,""next_ack_seq"":0,""next_err_seq"":0,""acked_msgid"":0}" |
Ok, I will try this fix but the issue occurs without any partition reshuffling as well since in some of our environments, we only have one pod handling all the partitions .
In any case, I will give this a try and come back with feedback.
Thanks,
…________________________________
From: Emanuele Sabellico ***@***.***>
Sent: Tuesday, October 31, 2023 12:03:07 PM
To: confluentinc/librdkafka ***@***.***>
Cc: Phil1972 ***@***.***>; Mention ***@***.***>
Subject: Re: [confluentinc/librdkafka] Partition stuck and does not receive any message. fetch_state:validate-epoch-wait (Issue #4492)
@Phil1972<https://github.com/Phil1972> it seems like you're hitting this bug #4433<#4433> .
It's present in 2.1.x and 2.2.0. Could you upgrade to 2.3.0 and check if it solves your issue?
—
Reply to this email directly, view it on GitHub<#4492 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJT6TC5QSNEFCNGRZNVSKTTYCEOLXAVCNFSM6AAAAAA6XZC3MCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBXGUZDCMBXHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@Phil1972 that is an internal partition state that means the offset validation is in progress. It was implemented as part of KIP-320 to prevent offsets resets in case of unclean leader elections. The bug could happen with two subsequent partition leader changes (broker side), when the first validation got an error and is retrying. Can happen with a single consumer too. |
Closing it: upgrading to v2.3.0 should solve this issue. Reopen otherwise |
Discussed in #4484
Originally posted by Phil1972 October 25, 2023
Hi,
We do not use transactions for message producer in our implementation. We have many producers and in this simple case, only one consumer process (in a kubernetes pod) which processes all 60 partitions. Messages are spread more or less evenly between producers/partitions. I do not see any evident issues on the producer side.
At some point, one (or more) partition do not receive messages anymore and I can observe a consumer lag on confluent's cluster.
We implemented a partition monitor so that we check when we get no message after a certain period of time. We got into this situation recently where we stopped getting messages from partition 35 (out of 60) and this is what we have in the stats for that partition:
"35": { "partition": 35, "broker": 21, "leader": 21, "desired": true, "unknown": false, "msgq_cnt": 0, "msgq_bytes": 0, "xmit_msgq_cnt": 0, "xmit_msgq_bytes": 0, "fetchq_cnt": 0, "fetchq_size": 0, "fetch_state": "validate-epoch-wait", "query_offset": -1001, "next_offset": 1245176, "app_offset": 1245176, "stored_offset": -1001, "stored_leader_epoch": -1, "commited_offset": 1245176, "committed_offset": 1245176, "committed_leader_epoch": 120, "eof_offset": 1245176, "lo_offset": -1, "hi_offset": -1, "ls_offset": -1, "consumer_lag": -1, "consumer_lag_stored": -1, "leader_epoch": 121, "txmsgs": 0, "txbytes": 0, "rxmsgs": 615, "rxbytes": 710580, "msgs": 615, "rx_ver_drops": 0, "msgs_inflight": 0, "next_ack_seq": 0, "next_err_seq": 0, "acked_msgid": 0 },
what is the meaning of validate-epoch-wait ?
We plan to restart the pod whenever we encounter this type of issue since we do not know how to recover from this nor do we know why it happens.
Any insight or help on this issue would be appreciated,
Thanks,
Confluent.Kafka nuget version 2.2.0
The text was updated successfully, but these errors were encountered: