-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-38180][task] Clean up task after switching to FAILED #26861
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -827,6 +829,8 @@ else if (transitionState(current, ExecutionState.FAILED, t)) { | |||
} | |||
// else fall through the loop and | |||
} | |||
|
|||
cleanUpRegistry.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE! Take a look at the try/catch below. There is a change in behaviour. Now any exception from the cleanup will not be suppressed any more but will be treated as fatal error (I think correctly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the PR!
@@ -612,6 +613,7 @@ private void doRun() { | |||
// need to be undone in the end | |||
Map<String, Future<Path>> distributedCacheEntries = new HashMap<>(); | |||
TaskInvokable invokable = null; | |||
AutoCloseableRegistry cleanUpRegistry = new AutoCloseableRegistry(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice the registry is created for every task run, but it's only actually used in the failure case.
What do you think about adding a comment like
// Registry for actions that should be run if the task fails
That would help future readers understand why it's unused on the success path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually hopping for it to be more generic. If someone needs to defer some action, this registry could be used.
I could maybe rephrase this comment to:
// Registry for actions that should be run after the task has already failed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure. Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix.
LGTM
e2e test was failing due to:
I think the problem was that I changed order of clean up calls. I'm trying to fix it with the fixup commit. |
flink-core/src/main/java/org/apache/flink/core/fs/AutoCloseableRegistry.java
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java
Outdated
Show resolved
Hide resolved
@@ -885,6 +850,53 @@ else if (transitionState(current, ExecutionState.FAILED, t)) { | |||
} | |||
} | |||
|
|||
/** | |||
* Transition into our final state in case of failure. We should be either in DEPLOYING, | |||
* INITIALIZING, RUNNING, CANCELING, or FAILED loop for multiple retries during concurrent state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The sentence does not read well. I suggest
- full stop after FAILED
- then
Loop to asynchronously clean up via calls to cancel() or to failExternally()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then Loop to asynchronously clean up via calls to cancel() or to failExternally()
I think this is not really better. I've rewritten the original sentence to:
Loop for multiple retries in case of concurrent state changes via calls to cancel() or to failExternally()
flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java
Outdated
Show resolved
Hide resolved
private void transitionStateOnFailure( | ||
Throwable t, AutoCloseableRegistry postFailureCleanUpRegistry) throws IOException { | ||
while (true) { | ||
ExecutionState current = this.executionState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding a yield()
call in the loop to prevent this thread hogging the cpu in a tight loop until done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I would prefer avoid touching this, as this is pre-existing logic that I'm just extracting to a separate method, and I would like to minimise changes for this already pretty fragile bug fix (I've been struggling to fix all of the failing e2e/itcases for quite some time).
- This can't loop for more then one or maybe a couple of times. Task can't keep changing it's state for a long period of time. At worst this will just make two or three iterations.
@@ -1072,7 +1072,8 @@ public final void cleanUp(Throwable throwable) throws Exception { | |||
LOG.debug( | |||
"Cleanup StreamTask (operators closed: {}, cancelled: {})", | |||
closedOperators, | |||
canceled); | |||
canceled, | |||
throwable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the throwable information come out in the dbeug?
I see the Logger.debug interface is
public void debug(String format, Object... arguments);
/**
* Log an exception (throwable) at the DEBUG level with an
* accompanying message.
*
* @param msg the message accompanying the exception
* @param t the exception (throwable) to log
*/
public void debug(String msg, Throwable t);
It looks like we should pass a resolved string and the throwable or have the Throwable as an insert in the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks fixed, good catch!
…depend on close not being called on failures
…ing closing Exceptions thrown during Close can prevent resources from being cleaned up and can cause TaskManager to exit with fatal error.
…throwing NPE on close Before, NPE could have been thrown if operator was closed before properly opening it, for example during task cancelation.
This prevents a race condition of some exception from clean up hiding the real exception if `failExternally` is used in the clean up
@flinkbot run azure |
What is the purpose of the change
Verifying this change
Added unit test to cover for a bug.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation