KAFKA-18645: New consumer should align close timeout handling with classic consumer #18702

frankvicky · 2025-01-25T06:58:27Z

JIRA: KAFKA-18645
see discussion:
#18590 (comment)

In the classic consumer, the timeout respects request.timeout.ms. However, in the async consumer, this logic is either missing or only applies to individual requests. Unlike the classic consumer, where request.timeout.ms works for the entire coordinator closing behavior, the async implementation handles timeouts differently.

We should align the close timeout-handling to enable ConsumerBounceTest#testClose

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

frankvicky · 2025-01-25T11:38:44Z

I have looped the test case with this patch:

frankvicky · 2025-01-25T13:26:58Z

doc preview:

kirktrue · 2025-01-27T20:16:08Z

So given this configuration:

request.timeout.ms=30000
group.protocol=classic

When the user calls Consumer.close(Duration.ofSeconds(60)), it will complete in 30 seconds, not 60 seconds? If so, it means the user's timeout parameter is ignored, right?

The change to the timeout behavior was introduced relatively recently in 3.5 via KAFKA-7109. Looking at #12590, I'm not sure the change to ignore the user's timeout was necessarily intentional.

When closing, individual network requests should adhere to request.timeout.ms, but the overall timeout for closing should adhere to the user-provided timeout.

lianetm · 2025-01-28T14:25:46Z

My understanding is that this behaviour on close regarding the timeout has been in place for a long time (introduced here f72203e). The fetch improvements of KAFKA-7109 seems to just apply to the fetch requests on close the same timeout principle that was being applied to the coordinator close. Makes sense?

I'll be taking a full pass today. Thanks!

frankvicky · 2025-01-28T16:48:13Z

The fetch improvements of KAFKA-7109 seems to just apply to the fetch requests on close the same timeout principle that was being applied to the coordinator close.

According to the commit you mentioned, I think you are right.
This behavior has existed for a long time and was recently applied to fetch requests on close at 3.5.0.

The change to the timeout behavior was introduced relatively recently in 3.5 via KAFKA-7109. Looking at #12590, I'm not sure the change to ignore the user's timeout was necessarily intentional.
When closing, individual network requests should adhere to request.timeout.ms, but the overall timeout for closing should adhere to the user-provided timeout.

This reminds me that the patch applied this close timeout principle to all closing behavior and I'm not sure if it is what we want. IIRC, classic consumer only used the timer for the coordinator and fetcher. 🤔

lianetm · 2025-01-28T16:58:31Z

classic consumer only used the timer for the coordinator and fetcher

True, but let's keep in mind that translating that to the new consumer means that we should apply the min to the steps to commit, leave, fetcher close, and network client close I expect (because this last step will wait for any of the requests generated on the previous steps). Makes sense? (Agree that we shouldn't apply it blindly)

frankvicky · 2025-01-28T17:09:14Z

Yes, thanks for the further explanation.

kirktrue · 2025-01-30T02:00:18Z

request.timeout.ms is designed to apply to each discrete network request whereas the close timeout applies to the entire KafkaConsumer.close() call. If the user provides a timeout of 60 seconds to close(), why should it give up after only 30 seconds? As a parallel, consider the relationship between request.timeout.ms and delivery.timeout.ms in the KafkaProducer.send() call. In that case, it's delivery.timeout.ms, not request.timeout.ms, that serves as the deadline by which all of the network calls in send() must complete.

That said, let's fix this gap with the solution as proposed. No one has complained about the behavior of the existing consumer timeout. I have no intention of dying on this hill; I just personally find it confusing 😄

kirktrue

Thanks for the PR @frankvicky.

I'd like to see a sanity check unit test added to somewhere like KafkaConsumerTest that ensures that the value of request.timeout.ms is used over the timeout passed in. If that's the intended and documented behavior, we should validate it to be so.

Thanks!

frankvicky · 2025-01-30T16:49:35Z

Hi @kirktrue

I'd like to see a sanity check unit test added to somewhere like KafkaConsumerTest that ensures that the value of request.timeout.ms is used over the timeout passed in. If that's the intended and documented behavior, we should validate it to be so.

I think it's not easy to check it. Could we rely on ConsumeBounceTest#testClose to validate this behavior ?

kirktrue · 2025-01-30T22:41:50Z

Hi @kirktrue

I'd like to see a sanity check unit test added to somewhere like KafkaConsumerTest that ensures that the value of request.timeout.ms is used over the timeout passed in. If that's the intended and documented behavior, we should validate it to be so.

I think it's not easy to check it. Could we rely on ConsumeBounceTest#testClose to validate this behavior ?

If it's a lot of work, I guess we can skip it and assume testClose() validates the desired behavior.

lianetm

Thanks @frankvicky !

lianetm · 2025-02-03T16:53:00Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

@@ -1326,7 +1331,7 @@ private void close(Duration timeout, boolean swallowException) {
        // We are already closing with a timeout, don't allow wake-ups from here on.
        wakeupTrigger.disableWakeups();

-        final Timer closeTimer = time.timer(timeout);
+        final Timer closeTimer = createTimer(timeout);


not introduced here but affected with this change. I notice that runRebalanceCallbacksOnClose consumes time from the close timeout, right? (receives the timer just to update it). But that behaviour is not the same in the Classic Consumer.

In the classic, the close timeout only applies to requests really. The callbacks run when closing the Abstract coordinator, without time boundaries, and most importantly, without consuming time from the close timeout. We runRebalanceCallbacksOnClose without time boundaries too, but it does consume the time from the timeout param, right? Wouldn't that potentially leave less time for the following requests? I'm concerned about existing apps, running callbacks, calling close with a timeout that used to be "enough", but now it may not be. Should we simply remove the timer from the runRebalanceCallbacksOnClose?

Yes, just walk through the corresponding logic of the classic consumer.
The callback on close doesn't consume the time for the timer, to align the behavior, I think it's ok to remove the timer from runRebalanceCallbacksOnClose

lianetm · 2025-02-03T18:15:43Z

clients/src/main/java/org/apache/kafka/clients/CommonClientConfigs.java

@@ -153,7 +153,8 @@ public class CommonClientConfigs {
    public static final String REQUEST_TIMEOUT_MS_DOC = "The configuration controls the maximum amount of time the client will wait "
                                                         + "for the response of a request. If the response is not received before the timeout "
                                                         + "elapses the client will resend the request if necessary or fail the request if "
-                                                         + "retries are exhausted.";
+                                                         + "retries are exhausted. This timeout also applies to the consumer close operation - "
+                                                         + "even if a larger timeout is specified for close, it will be limited by this value.";


well just for our understanding and related to my other comment: the classic behaviour we're trying to keep is that the request timeout applies to operations performed with the coordinator and the leader (coordinator-related requests and fetch sessions), not to other close operations that do not perform any request. I'm not suggesting to add any of that here, would polute this config doc imo. Actually, since it's really specific to consumer.close, isn't it enough to explain the behaviour on the close API java doc?

isn't it enough to explain the behaviour on the close API java doc?

+1 to add docs to close API

It's crucial to highlight that neither the OffsetCommitCallback nor the ConsumerRebalanceListener callbacks consume time from the close timeout.

lianetm

Thanks for the updates @frankvicky !

lianetm · 2025-02-04T14:38:25Z

clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java

@@ -1782,6 +1782,13 @@ public void close() {
     * timeout. If the consumer is unable to complete offset commits and gracefully leave the group
     * before the timeout expires, the consumer is force closed. Note that {@link #wakeup()} cannot be
     * used to interrupt close.
+     * <p>
+     * The actual maximum wait time is bounded by the {@link ConsumerConfig#REQUEST_TIMEOUT_MS_CONFIG} setting, which
+     * only applies to operations performed with the coordinator (coordinator-related requests and


Should we say this instead? (could be to the coordinator or the leader)

Suggested change

* only applies to operations performed with the coordinator (coordinator-related requests and

* only applies to operations performed with the broker (coordinator-related requests and

lianetm · 2025-02-04T14:45:53Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

@@ -1368,6 +1373,10 @@ private void close(Duration timeout, boolean swallowException) {
        }
    }

+    private Timer createTimer(Duration timeout) {
+        return time.timer(Duration.ofMillis(Math.min(timeout.toMillis(), requestTimeoutMs)));


Suggested change

return time.timer(Duration.ofMillis(Math.min(timeout.toMillis(), requestTimeoutMs)));

return time.timer(Math.min(timeout.toMillis(), requestTimeoutMs));

lianetm · 2025-02-04T15:01:49Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

@@ -1368,6 +1373,10 @@ private void close(Duration timeout, boolean swallowException) {
        }
    }

+    private Timer createTimer(Duration timeout) {


nit: should we show this is for the close op? createTimerForCloseRequests or similar maybe?

Yes, it is nice to follow the style of classic consumers.

lianetm · 2025-02-04T15:17:56Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

@@ -1368,6 +1373,10 @@ private void close(Duration timeout, boolean swallowException) {
        }
    }

+    private Timer createTimer(Duration timeout) {
+        return time.timer(Duration.ofMillis(Math.min(timeout.toMillis(), requestTimeoutMs)));


Should we add a null check to the time obj here? Similar to what the classic does, since close can be called from the constructor at any point (finally) if something fails.

lianetm · 2025-02-04T20:28:43Z

@frankvicky could you please merge trunk latest changes? The flaky/failing transactions test has been fixed with #18793 Thanks!

…assic consumer JIRA: KAFKA-18645 see discussion: apache#18590 (comment) In the classic consumer, the timeout respects request.timeout.ms. However, in the async consumer, this logic is either missing or only applies to individual requests. Unlike the classic consumer, where request.timeout.ms works for the entire coordinator closing behavior, the async implementation handles timeouts differently. We should align the close timeout-handling to enable ConsumerBounceTest#testClose

frankvicky · 2025-02-05T02:59:19Z

Sure!

lianetm

Thanks! LGTM

…assic consumer (#18702) Reviewers: Lianet Magrans <lmagrans@confluent.io>, Kirk True <ktrue@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>

lianetm · 2025-02-05T14:18:08Z

Merged to trunk and cherry picked to 4.0

…assic consumer (apache#18702) Reviewers: Lianet Magrans <lmagrans@confluent.io>, Kirk True <ktrue@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>

github-actions bot added triage PRs from the community core Kafka Broker consumer clients small Small PRs labels Jan 25, 2025

chia7712 added the ci-approved label Jan 25, 2025

frankvicky force-pushed the KAFKA-18645 branch from b100413 to 0cfe496 Compare January 25, 2025 13:25

frankvicky force-pushed the KAFKA-18645 branch from 0cfe496 to 8675c7b Compare January 26, 2025 02:03

frankvicky force-pushed the KAFKA-18645 branch from 8675c7b to de789f2 Compare January 28, 2025 05:23

frankvicky force-pushed the KAFKA-18645 branch from de789f2 to a4e4bf5 Compare January 29, 2025 02:55

kirktrue suggested changes Jan 30, 2025

View reviewed changes

github-actions bot removed the triage PRs from the community label Jan 30, 2025

frankvicky force-pushed the KAFKA-18645 branch from a4e4bf5 to a0a2920 Compare January 30, 2025 16:48

frankvicky force-pushed the KAFKA-18645 branch 2 times, most recently from 944a037 to e27781a Compare February 2, 2025 02:57

lianetm reviewed Feb 3, 2025

View reviewed changes

frankvicky force-pushed the KAFKA-18645 branch from e27781a to 62b3dc2 Compare February 4, 2025 09:05

lianetm reviewed Feb 4, 2025

View reviewed changes

frankvicky force-pushed the KAFKA-18645 branch from 390cb14 to c62499f Compare February 4, 2025 15:52

frankvicky force-pushed the KAFKA-18645 branch from c62499f to ca8c76e Compare February 5, 2025 02:58

lianetm approved these changes Feb 5, 2025

View reviewed changes

lianetm merged commit 6636316 into apache:trunk Feb 5, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-18645: New consumer should align close timeout handling with classic consumer #18702

KAFKA-18645: New consumer should align close timeout handling with classic consumer #18702

frankvicky commented Jan 25, 2025

frankvicky commented Jan 25, 2025

frankvicky commented Jan 25, 2025

kirktrue commented Jan 27, 2025

lianetm commented Jan 28, 2025

frankvicky commented Jan 28, 2025

lianetm commented Jan 28, 2025

frankvicky commented Jan 28, 2025

kirktrue commented Jan 30, 2025

kirktrue left a comment

frankvicky commented Jan 30, 2025

kirktrue commented Jan 30, 2025

lianetm left a comment

lianetm Feb 3, 2025 •

edited

Loading

frankvicky Feb 4, 2025

lianetm Feb 3, 2025

chia7712 Feb 4, 2025

chia7712 Feb 4, 2025

lianetm left a comment

lianetm Feb 4, 2025

lianetm Feb 4, 2025

lianetm Feb 4, 2025

frankvicky Feb 4, 2025

lianetm Feb 4, 2025

lianetm commented Feb 4, 2025 •

edited

Loading

frankvicky commented Feb 5, 2025

lianetm left a comment

lianetm commented Feb 5, 2025

	* only applies to operations performed with the coordinator (coordinator-related requests and
	* only applies to operations performed with the broker (coordinator-related requests and

	return time.timer(Duration.ofMillis(Math.min(timeout.toMillis(), requestTimeoutMs)));
	return time.timer(Math.min(timeout.toMillis(), requestTimeoutMs));

KAFKA-18645: New consumer should align close timeout handling with classic consumer #18702

KAFKA-18645: New consumer should align close timeout handling with classic consumer #18702

Conversation

frankvicky commented Jan 25, 2025

Committer Checklist (excluded from commit message)

frankvicky commented Jan 25, 2025

frankvicky commented Jan 25, 2025

kirktrue commented Jan 27, 2025

lianetm commented Jan 28, 2025

frankvicky commented Jan 28, 2025

lianetm commented Jan 28, 2025

frankvicky commented Jan 28, 2025

kirktrue commented Jan 30, 2025

kirktrue left a comment

Choose a reason for hiding this comment

frankvicky commented Jan 30, 2025

kirktrue commented Jan 30, 2025

lianetm left a comment

Choose a reason for hiding this comment

lianetm Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianetm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianetm commented Feb 4, 2025 • edited Loading

frankvicky commented Feb 5, 2025

lianetm left a comment

Choose a reason for hiding this comment

lianetm commented Feb 5, 2025

lianetm Feb 3, 2025 •

edited

Loading

lianetm commented Feb 4, 2025 •

edited

Loading