Skip to content

propagate scope in async failures #3950

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

igormq
Copy link

@igormq igormq commented Jun 7, 2025

Fix trace context loss in async Kafka error handling

This PR addresses an issue where the trace context is lost when handling Kafka message failures asynchronously.

Problem

When async returns are enabled and a consumer failure occurs, the trace context from the original message is not propagated. This leads to each step of the retry/DLT flow starting a new trace instead of continuing the original one.

Example (current behavior):
• Producer → trace 1
• Consumer → trace 1, fails → message goes to retry topic
• Retry listener → trace 2, fails → message goes to DLT topic
• DLT listener → trace 3

This breaks end-to-end traceability, as each listener receives a new trace ID.

Root cause

The issue stems from the handleAsyncFailure method, which runs in a different thread but does not propagate the original Observation (trace) context associated with the failed record.

Fix

Ensure that the observation context is correctly propagated when handling async failures. This preserves the trace ID across retry and DLT flows.

🔧 Tested using version 3.3.6 so I could build and validate the JAR in a real-world project.

@igormq igormq changed the base branch from main to 3.3.x June 7, 2025 09:38
acknowledge(acknowledgment);
if (canAsyncRetry(request, ex) && this.asyncRetryCallback != null) {
@SuppressWarnings("unchecked")
ConsumerRecord<K, V> record = (ConsumerRecord<K, V>) request;
this.asyncRetryCallback.accept(record, (RuntimeException) ex);
} else {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annoying receiving this error if we can recover from that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what your comment means, but that is indeed not our code style.
See this formatting config to be provided for IntelliJ IDEA: https://github.com/spring-projects/spring-kafka/blob/main/src/idea/spring-framework.xml

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just that we are receiving these errors in the log, but we still can recover from that, so I moved this log here!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else sentence has to be from a new line.

Copy link
Member

@artembilan artembilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see the fix issued against main.
And please, follow a DCO requirements.

@igormq igormq force-pushed the propagate-scope-in-async-failures branch from bfb8f6d to b4be8a3 Compare June 10, 2025 10:56
@igormq igormq changed the base branch from 3.3.x to main June 10, 2025 10:57
@igormq igormq force-pushed the propagate-scope-in-async-failures branch from b4be8a3 to 73aeaaf Compare June 10, 2025 10:59
@igormq
Copy link
Author

igormq commented Jun 10, 2025

I'd like to see the fix issued against main. And please, follow a DCO requirements.

done!

@igormq igormq requested a review from artembilan June 10, 2025 10:59
// copyFailedRecord.observation.scoped(() -> invokeErrorHandlerBySingleRecord(copyFailedRecord));
// } else {
invokeErrorHandlerBySingleRecord(copyFailedRecord);
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to scoped() wrapper?
Why all of these commenting?
And what is the point of that Observation property in the FailedRecordTuple if it is out of use?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, submit code that i was trying to fix the test. SOrry @artembilan

acknowledge(acknowledgment);
if (canAsyncRetry(request, ex) && this.asyncRetryCallback != null) {
@SuppressWarnings("unchecked")
ConsumerRecord<K, V> record = (ConsumerRecord<K, V>) request;
this.asyncRetryCallback.accept(record, (RuntimeException) ex);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else sentence has to be from a new line.

@@ -72,6 +86,9 @@
import org.springframework.kafka.KafkaException;
import org.springframework.kafka.annotation.EnableKafka;
import org.springframework.kafka.annotation.KafkaListener;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think all of these black lines in imports are correct.
Please, run gradlew check and fix all the Checkstyle violations before pushing to PR.

@@ -3433,7 +3438,7 @@ private Collection<ConsumerRecord<K, V>> getHighestOffsetRecords(ConsumerRecords
}

private void callbackForAsyncFailure(ConsumerRecord<K, V> cRecord, RuntimeException ex) {
this.failedRecords.addLast(new FailedRecordTuple<>(cRecord, ex));
this.failedRecords.addLast(new FailedRecordTuple<>(cRecord, ex, this.observationRegistry.getCurrentObservation()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct.
That this.observationRegistry.getCurrentObservation() is going to be called when this whole method is called from the completableFutureResult.whenComplet. And this, in turn, might be called from totally different thread.
So, not a fact that observation is there.

I'm not sure what is the real problem, so I cannot think of something how to help you to propagate that one down to that exception handler for async callbacks.

Copy link
Author

@igormq igormq Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @artembilan while doing the tests realised that.

i refactored everything, if you can have another look i would really appreciate. I will update the description of the PR to clearly exemplify the problem. Also, the tests that i added is pretty clear of the expectations!

@igormq igormq force-pushed the propagate-scope-in-async-failures branch 3 times, most recently from afc0a09 to 00190ef Compare June 11, 2025 10:32
Signed-off-by: Igor Macedo Quintanilha <igor.quintanilha@teya.com>
@igormq igormq force-pushed the propagate-scope-in-async-failures branch from 00190ef to 2fea4fd Compare June 11, 2025 10:51
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants