[FLINK-38180][task] Clean up task after switching to FAILED #26861

pnowojski · 2025-08-01T09:03:52Z

What is the purpose of the change

This prevents a race condition of some exception from clean up
hiding the real exception if `failExternally` is used in the clean up

Verifying this change

Added unit test to cover for a bug.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

pnowojski · 2025-08-01T09:05:08Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java

@@ -827,6 +829,8 @@ else if (transitionState(current, ExecutionState.FAILED, t)) {
                    }
                    // else fall through the loop and
                }
+
+                cleanUpRegistry.close();


NOTE! Take a look at the try/catch below. There is a change in behaviour. Now any exception from the cleanup will not be suppressed any more but will be treated as fatal error (I think correctly).

flinkbot · 2025-08-01T09:06:39Z

CI report:

eb8584f Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

rkhachatryan

LGTM, thanks for the fix

Savonitar

Hi, thanks for the PR!

Savonitar · 2025-08-01T10:01:41Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java

@@ -612,6 +613,7 @@ private void doRun() {
        // need to be undone in the end
        Map<String, Future<Path>> distributedCacheEntries = new HashMap<>();
        TaskInvokable invokable = null;
+        AutoCloseableRegistry cleanUpRegistry = new AutoCloseableRegistry();


I notice the registry is created for every task run, but it's only actually used in the failure case.
What do you think about adding a comment like

// Registry for actions that should be run if the task fails

That would help future readers understand why it's unused on the success path.

I was actually hopping for it to be more generic. If someone needs to defer some action, this registry could be used.

I could maybe rephrase this comment to:

// Registry for actions that should be run after the task has already failed

?

Yes, sure. Sounds good.

Savonitar

Thanks for the fix.
LGTM

pnowojski · 2025-08-01T14:08:52Z

e2e test was failing due to:

java.lang.NullPointerException: Cannot invoke "org.apache.flink.table.runtime.util.collections.binary.AbstractBytesHashMap.free()" because "this.aggregateMap$8" is null
        at LocalHashAggregateWithKeys$133.close(Unknown Source) ~[?:?]
        at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:163) ~[flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.closeAllOperators(RegularOperatorChain.java:125) ~[flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:1197) ~[flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:257) ~[flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:72) ~[flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127) ~[flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUp(StreamTask.java:1101) ~[flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$0(Task.java:950) ~[flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:965) ~[flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$1(Task.java:950) [flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:257) ~[flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:72) ~[flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127) ~[flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:833) [flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:569) [flink-dist-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at java.base/java.lang.Thread.run(Thread.java:840) [?:?]

I think the problem was that I changed order of clean up calls. I'm trying to fix it with the fixup commit.

flink-core/src/main/java/org/apache/flink/core/fs/AutoCloseableRegistry.java

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java

davidradl · 2025-08-13T13:46:14Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java

@@ -885,6 +850,53 @@ else if (transitionState(current, ExecutionState.FAILED, t)) {
        }
    }

+    /**
+     * Transition into our final state in case of failure. We should be either in DEPLOYING,
+     * INITIALIZING, RUNNING, CANCELING, or FAILED loop for multiple retries during concurrent state


nit: The sentence does not read well. I suggest

full stop after FAILED

then Loop to asynchronously clean up via calls to cancel() or to failExternally()

then Loop to asynchronously clean up via calls to cancel() or to failExternally()

I think this is not really better. I've rewritten the original sentence to:

Loop for multiple retries in case of concurrent state changes via calls to cancel() or to failExternally()

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java

davidradl · 2025-08-13T13:47:35Z

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java

+    private void transitionStateOnFailure(
+            Throwable t, AutoCloseableRegistry postFailureCleanUpRegistry) throws IOException {
+        while (true) {
+            ExecutionState current = this.executionState;


I suggest adding a yield() call in the loop to prevent this thread hogging the cpu in a tight loop until done.

I would prefer avoid touching this, as this is pre-existing logic that I'm just extracting to a separate method, and I would like to minimise changes for this already pretty fragile bug fix (I've been struggling to fix all of the failing e2e/itcases for quite some time).

This can't loop for more then one or maybe a couple of times. Task can't keep changing it's state for a long period of time. At worst this will just make two or three iterations.

davidradl · 2025-08-13T13:52:34Z

flink-runtime/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java

@@ -1072,7 +1072,8 @@ public final void cleanUp(Throwable throwable) throws Exception {
        LOG.debug(
                "Cleanup StreamTask (operators closed: {}, cancelled: {})",
                closedOperators,
-                canceled);
+                canceled,
+                throwable);


Does the throwable information come out in the dbeug?

I see the Logger.debug interface is

public void debug(String format, Object... arguments); /** * Log an exception (throwable) at the DEBUG level with an * accompanying message. * * @param msg the message accompanying the exception * @param t the exception (throwable) to log */ public void debug(String msg, Throwable t);

It looks like we should pass a resolved string and the throwable or have the Throwable as an insert in the message.

Thanks fixed, good catch!

…depend on close not being called on failures

…ing closing Exceptions thrown during Close can prevent resources from being cleaned up and can cause TaskManager to exit with fatal error.

…throwing NPE on close Before, NPE could have been thrown if operator was closed before properly opening it, for example during task cancelation.

…tration order

This prevents a race condition of some exception from clean up hiding the real exception if `failExternally` is used in the clean up

pnowojski · 2025-08-18T16:09:49Z

@flinkbot run azure

pnowojski commented Aug 1, 2025

View reviewed changes

rkhachatryan approved these changes Aug 1, 2025

View reviewed changes

Savonitar reviewed Aug 1, 2025

View reviewed changes

pnowojski force-pushed the f38180 branch from 6dfc986 to b2ecc39 Compare August 1, 2025 13:36

Savonitar approved these changes Aug 1, 2025

View reviewed changes

pnowojski force-pushed the f38180 branch from b2ecc39 to 9cb1692 Compare August 1, 2025 14:07

pnowojski force-pushed the f38180 branch from 9cb1692 to 1155d27 Compare August 1, 2025 14:15

github-actions bot added community-reviewed PR has been reviewed by the community. and removed community-reviewed PR has been reviewed by the community. labels Aug 1, 2025

github-actions bot added community-reviewed PR has been reviewed by the community. and removed community-reviewed PR has been reviewed by the community. labels Aug 9, 2025

pnowojski force-pushed the f38180 branch from 7ba00b7 to cac3465 Compare August 13, 2025 12:57

davidradl reviewed Aug 13, 2025

View reviewed changes

flink-core/src/main/java/org/apache/flink/core/fs/AutoCloseableRegistry.java Show resolved Hide resolved

davidradl reviewed Aug 13, 2025

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java Outdated Show resolved Hide resolved

davidradl reviewed Aug 13, 2025

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java Outdated Show resolved Hide resolved

davidradl reviewed Aug 13, 2025

View reviewed changes

github-actions bot added community-reviewed PR has been reviewed by the community. and removed community-reviewed PR has been reviewed by the community. labels Aug 13, 2025

pnowojski added 9 commits August 18, 2025 11:17

[hotfix] Improve logging in StreamTask#cleanUp

fc7e5e9

[hotfix] Better error message in TaskTest

b2f4d2e

[hotfix] Extract Task#transitionStateOnFailure

5c5fff9

[FLINK-38180][test] Rewrite testWithRocksDbBackendIncremental to not …

bf474be

…depend on close not being called on failures

[FLINK-38180][test] Rewrite ValidatingSink to not throw exception dur…

d2c979b

…ing closing Exceptions thrown during Close can prevent resources from being cleaned up and can cause TaskManager to exit with fatal error.

[FLINK-38180][table] Safeguard bunch of operators and functions from …

d39f11d

…throwing NPE on close Before, NPE could have been thrown if operator was closed before properly opening it, for example during task cancelation.

[FLINK-38180][task] Allow AutoCloseableRegistry to close in the regis…

ee41f9a

…tration order

[FLINK-38180][task] Clean up task after switching to FAILED

e0009bb

This prevents a race condition of some exception from clean up hiding the real exception if `failExternally` is used in the clean up

[hotfix] Improve comment about Task's state transitions

eb8584f

pnowojski force-pushed the f38180 branch from cac3465 to eb8584f Compare August 18, 2025 09:17

[FLINK-38180][task] Clean up task after switching to FAILED #26861

Are you sure you want to change the base?

[FLINK-38180][task] Clean up task after switching to FAILED #26861

Conversation

pnowojski commented Aug 1, 2025

What is the purpose of the change

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

pnowojski Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flinkbot commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

Savonitar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Savonitar left a comment

Choose a reason for hiding this comment

Uh oh!

pnowojski commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pnowojski commented Aug 18, 2025

Uh oh!

Uh oh!

pnowojski Aug 1, 2025 •

edited

Loading

flinkbot commented Aug 1, 2025 •

edited

Loading