[v2-10-test] Re-queue tassk when they are stuck in queued (#43520) #44158

jscheffl · 2024-11-18T19:55:07Z

Backport of #43520.
Note: Cherry-pick is w/o K8s provider files as these are always taken from main during test and release.

The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events.

(cherry picked from commit a41feeb)

The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config. We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events. --------- (cherry picked from commit a41feeb) Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

dstandish · 2024-11-18T21:43:03Z

might need this as well @jscheffl #44093

jscheffl · 2024-11-18T21:46:34Z

might need this as well @jscheffl #44093

Yeeah, figured out the same commit right at the same time :-D Added to the PR!

…44158) * [v2-10-test] Re-queue tassk when they are stuck in queued (#43520) The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config. We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events. --------- (cherry picked from commit a41feeb) Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * fix test_handle_stuck_queued_tasks_multiple_attempts (#44093) --------- Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> Co-authored-by: GPK <gopidesupavan@gmail.com>

jscheffl added this to the Airflow 2.10.4 milestone Nov 18, 2024

jscheffl requested review from kaxil, ashb, XD-DENG, o-nikolas, pierrejeambrun and hussein-awala as code owners November 18, 2024 19:55

boring-cyborg bot added area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler kind:documentation labels Nov 18, 2024

jscheffl requested review from potiuk, dstandish and jedcunningham November 18, 2024 19:56

jscheffl added the type:bug-fix Changelog: Bug Fixes label Nov 18, 2024

jedcunningham approved these changes Nov 18, 2024

View reviewed changes

fix test_handle_stuck_queued_tasks_multiple_attempts (apache#44093)

dd43b10

jscheffl merged commit 341d36d into apache:v2-10-test Nov 19, 2024
48 checks passed

utkarsharma2 mentioned this pull request Dec 10, 2024

Status of testing of Apache Airflow 2.10.4rc1 #44811

Closed

33 tasks

eladkal mentioned this pull request Dec 22, 2024

"Task stuck in queued" should not count against retries #38304

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2-10-test] Re-queue tassk when they are stuck in queued (#43520) #44158

[v2-10-test] Re-queue tassk when they are stuck in queued (#43520) #44158

jscheffl commented Nov 18, 2024

dstandish commented Nov 18, 2024

jscheffl commented Nov 18, 2024

[v2-10-test] Re-queue tassk when they are stuck in queued (#43520) #44158

[v2-10-test] Re-queue tassk when they are stuck in queued (#43520) #44158

Conversation

jscheffl commented Nov 18, 2024

dstandish commented Nov 18, 2024

jscheffl commented Nov 18, 2024