Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Partition stuck and does not receive any message. fetch_state:validate-epoch-wait #4492

Closed
Phil1972 opened this issue Oct 31, 2023 Discussed in #4484 · 7 comments
Closed

Comments

@Phil1972
Copy link

Phil1972 commented Oct 31, 2023

Discussed in #4484

Originally posted by Phil1972 October 25, 2023
Hi,

We do not use transactions for message producer in our implementation. We have many producers and in this simple case, only one consumer process (in a kubernetes pod) which processes all 60 partitions. Messages are spread more or less evenly between producers/partitions. I do not see any evident issues on the producer side.

At some point, one (or more) partition do not receive messages anymore and I can observe a consumer lag on confluent's cluster.

We implemented a partition monitor so that we check when we get no message after a certain period of time. We got into this situation recently where we stopped getting messages from partition 35 (out of 60) and this is what we have in the stats for that partition:

"35": { "partition": 35, "broker": 21, "leader": 21, "desired": true, "unknown": false, "msgq_cnt": 0, "msgq_bytes": 0, "xmit_msgq_cnt": 0, "xmit_msgq_bytes": 0, "fetchq_cnt": 0, "fetchq_size": 0, "fetch_state": "validate-epoch-wait", "query_offset": -1001, "next_offset": 1245176, "app_offset": 1245176, "stored_offset": -1001, "stored_leader_epoch": -1, "commited_offset": 1245176, "committed_offset": 1245176, "committed_leader_epoch": 120, "eof_offset": 1245176, "lo_offset": -1, "hi_offset": -1, "ls_offset": -1, "consumer_lag": -1, "consumer_lag_stored": -1, "leader_epoch": 121, "txmsgs": 0, "txbytes": 0, "rxmsgs": 615, "rxbytes": 710580, "msgs": 615, "rx_ver_drops": 0, "msgs_inflight": 0, "next_ack_seq": 0, "next_err_seq": 0, "acked_msgid": 0 },

what is the meaning of validate-epoch-wait ?

We plan to restart the pod whenever we encounter this type of issue since we do not know how to recover from this nor do we know why it happens.

Any insight or help on this issue would be appreciated,

Thanks,

Confluent.Kafka nuget version 2.2.0

@Phil1972
Copy link
Author

Phil1972 commented Oct 31, 2023

We had many occurrences since saturday and today I noticed that there is some typos in the stats as well generating too many entries:
"50": { "partition": 50, "broker": 16, "leader": 16, "desired": true, "unknown": false, "msgq_cnt": 0, "msgq_bytes": 0, "xmit_msgq_cnt": 0, "xmit_msgq_bytes": 0, "fetchq_cnt": 0, "fetchq_size": 0, "fetch_state": "validate-epoch-wait", "query_offset": -1001, "next_offset": 4722417, "app_offset": 4722417, "stored_offset": -1001, "stored_leader_epoch": -1, "commited_offset": 4722417, "committed_offset": 4722417, "committed_leader_epoch": 178, "eof_offset": 4722416, "lo_offset": -1, "hi_offset": -1, "ls_offset": -1, "consumer_lag": -1, "consumer_lag_stored": -1, "leader_epoch": 179, "txmsgs": 0, "txbytes": 0, "rxmsgs": 3128, "rxbytes": 3913726, "msgs": 3128, "rx_ver_drops": 0, "msgs_inflight": 0, "next_ack_seq": 0, "next_err_seq": 0, "acked_msgid": 0 },

I think there are typos that generates redondant info in the json:
"commited_offset": 4722417,
"committed_offset": 4722417,

@Phil1972 Phil1972 changed the title Partition stuck and do not receive any message. fetch_state:validate-epoch-wait Partition stuck and does not receive any message. fetch_state:validate-epoch-wait Oct 31, 2023
@Phil1972
Copy link
Author

Phil1972 commented Oct 31, 2023

We noticed some Kafka Errors that are flagged as non-fatal prior to the issue (most of the time but not always):

`"{"@t":"2023-10-31T05:26:33.5182985Z","@mt":"Kafka logs: [thrd:sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16]: sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16: Timed out 1337 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests","@l":"Warning"}"
"{"@t":"2023-10-31T05:26:33.5188196Z","@mt":"Kafka logs: [thrd:sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16]: sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16: 1337 request(s) timed out: disconnect (average rtt 2373.701ms) (after 82773ms in state UP)","@l":"Error"}"

"{"@t":"2023-10-31T05:26:33.5212463Z","@mt":"Kafka consumer error: {ErrorCode} because {Reason} (IsFatal: {IsFatal}, broker error: {IsBrokerError}, local error : {IsLocalError})","@l":"Fatal","ErrorCode":"Local_TimedOut","Reason":"sasl_ssl://b16-pkc-lgwgm.eastus2.azure.confluent.cloud:9092/16: 1337 request(s) timed out: disconnect (average rtt 2373.701ms) (after 82773ms in state UP)","IsFatal":false,"IsBrokerError":false,"IsLocalError":true}"

...

And then our partition monitoring mechanism kicks in:

"{""@t"":""2023-10-31T05:30:02.1151066Z"",""@mt"":""No message received for assigned partitions [Topid [[5]],Topic [[14]]] since 00:05:20.8702035(Threshold: 00:05:00). Partitions jsons: {""partition"":5,""broker"":16,""leader"":16,""desired"":true,""unknown"":false,""msgq_cnt"":0,""msgq_bytes"":0,""xmit_msgq_cnt"":0,""xmit_msgq_bytes"":0,""fetchq_cnt"":0,""fetchq_size"":0,""fetch_state"":""validate-epoch-wait"",""query_offset"":-1001,""next_offset"":4727048,""app_offset"":4727048,""stored_offset"":-1001,""stored_leader_epoch"":-1,""commited_offset"":4727048,""committed_offset"":4727048,""committed_leader_epoch"":177,""eof_offset"":4727048,""lo_offset"":-1,""hi_offset"":-1,""ls_offset"":-1,""consumer_lag"":-1,""consumer_lag_stored"":-1,""leader_epoch"":178,""txmsgs"":0,""txbytes"":0,""rxmsgs"":3269,""rxbytes"":3965587,""msgs"":3269,""rx_ver_drops"":0,""msgs_inflight"":0,""next_ack_seq"":0,""next_err_seq"":0,""acked_msgid"":0}\n{""partition"":14,""broker"":16,""leader"":16,""desired"":true,""unknown"":false,""msgq_cnt"":0,""msgq_bytes"":0,""xmit_msgq_cnt"":0,""xmit_msgq_bytes"":0,""fetchq_cnt"":0,""fetchq_size"":0,""fetch_state"":""validate-epoch-wait"",""query_offset"":-1001,""next_offset"":4738019,""app_offset"":4738019,""stored_offset"":-1001,""stored_leader_epoch"":-1,""commited_offset"":4738019,""committed_offset"":4738019,""committed_leader_epoch"":164,""eof_offset"":4738019,""lo_offset"":-1,""hi_offset"":-1,""ls_offset"":-1,""consumer_lag"":-1,""consumer_lag_stored"":-1,""leader_epoch"":165,""txmsgs"":0,""txbytes"":0,""rxmsgs"":3308,""rxbytes"":4052406,""msgs"":3308,""rx_ver_drops"":0,""msgs_inflight"":0,""next_ack_seq"":0,""next_err_seq"":0,""acked_msgid"":0}"
`

@emasab
Copy link
Contributor

emasab commented Oct 31, 2023

@Phil1972 it seems like you're hitting this bug #4433 .
It's present in 2.1.x and 2.2.0. Could you upgrade to 2.3.0 and check if it solves your issue?

@Phil1972
Copy link
Author

Phil1972 commented Oct 31, 2023 via email

@Phil1972
Copy link
Author

@Phil1972 it seems like you're hitting this bug #4433 . It's present in 2.1.x and 2.2.0. Could you upgrade to 2.3.0 and check if it solves your issue?

also, any information on the fetch_state undocumented value? what does it means when a partition is in that state?

ie: validate-epoch-wait

thanks,

@emasab
Copy link
Contributor

emasab commented Nov 1, 2023

@Phil1972 that is an internal partition state that means the offset validation is in progress. It was implemented as part of KIP-320 to prevent offsets resets in case of unclean leader elections. The bug could happen with two subsequent partition leader changes (broker side), when the first validation got an error and is retrying. Can happen with a single consumer too.

@emasab
Copy link
Contributor

emasab commented Nov 16, 2023

Closing it: upgrading to v2.3.0 should solve this issue. Reopen otherwise

@emasab emasab closed this as completed Nov 16, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

2 participants