Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix loop of OffsetForLeaderEpoch requests on quick leader changes #4433

Merged
merged 6 commits into from
Sep 29, 2023

Conversation

milindl
Copy link
Contributor

@milindl milindl commented Sep 14, 2023

Fixes #4425

@milindl milindl requested a review from a team September 14, 2023 05:26
@mjd95
Copy link

mjd95 commented Sep 19, 2023

Thanks for the PR! We have a pretty reliable scenario for triggering #4425 (we do a reassignment on a small topic so that there are at least two partition state changes in short succession, the first and then all subsequent OffsetForLeaderEpoch requests fail with FencedLeaderEpoch).

We ran the consumers from this branch and confirmed that we don't see the issue with this patch.

@milindl
Copy link
Contributor Author

milindl commented Sep 19, 2023

Thanks for confirming it with some independent testing @mjd95!

Copy link
Member

@pranavrth pranavrth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a question and a nit.

Comment on lines 266 to 272
rd_kafka_mock_broker_push_request_error_rtts(
mcluster, 2, RD_KAFKAP_OffsetForLeaderEpoch, 1,
RD_KAFKA_RESP_ERR_KAFKA_STORAGE_ERROR, 900);

rd_kafka_mock_broker_push_request_error_rtts(
mcluster, 2, RD_KAFKAP_OffsetForLeaderEpoch, 1,
RD_KAFKA_RESP_ERR_NO_ERROR, 1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge these two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*
* See #4425.
*/
static void do_test_two_leader_changes(void) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm that this test fails with the old code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it fails with a timeout after we enter the infinite loop

@emasab emasab merged commit 788cd0c into master Sep 29, 2023
@emasab emasab deleted the dev_bug_offsetforleaderepoch branch September 29, 2023 10:23
axelandersson added a commit to axelandersson/librdkafka that referenced this pull request Oct 5, 2023
* upstream/master:
  librdkafka v2.3.0 (confluentinc#4455)
  Fix for idempotent producer fatal errors, triggered after a possibly persisted message state (confluentinc#4438)
  Move can_q_contain_fetched_msgs inside q_serve (confluentinc#4431)
  [KIP-580] Exponential Backoff with Mock Broker Changes to Automate Testing. (confluentinc#4422)
  Update only the mklove version of OpenSSL to 3.0.11 (confluentinc#4454)
  Permanent errors during offset validation should be retried (confluentinc#4447)
  Increased flexver request size for Metadata request to include topic_id size (confluentinc#4453)
  Fix loop of OffsetForLeaderEpoch requests on quick leader changes (confluentinc#4433)
  Fix for stored offsets not being committed if they lacked the leader epoch (confluentinc#4442)
  Add leader epoch to control messages (confluentinc#4434)
  Refactored tmpabuf and fixed an insufficient buffer allocation (confluentinc#4449)
  Work around KIP-700 restrictions for DescribeCluster [KIP-430]
  [admin] KIP-430: Add authorized operations to describe API
  Fix segfault if assignor state is NULL, (confluentinc#4381)
@ktrance
Copy link

ktrance commented Oct 15, 2023

@emasab Do you have an ETA on when 2.3 will be released? We believe this might the issue we are experiencing.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OffsetForLeaderEpoch loop of failed requests with multiple leader changes
5 participants