Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[v2-10-test] Re-queue tassk when they are stuck in queued (#43520) #44158

Merged
merged 2 commits into from
Nov 19, 2024

Conversation

jscheffl
Copy link
Contributor

Backport of #43520.
Note: Cherry-pick is w/o K8s provider files as these are always taken from main during test and release.

The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events.


(cherry picked from commit a41feeb)

The old "stuck in queued" logic just failed the tasks.  Now we requeue them.  We accomplish this by revoking the task from executor and setting state to scheduled.  We'll re-queue it up to 2 times.  Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc.  We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action?  Anyway this avoids having to deal with "state mismatch" issues when processing events.

---------

(cherry picked from commit a41feeb)

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
@jscheffl jscheffl added this to the Airflow 2.10.4 milestone Nov 18, 2024
@boring-cyborg boring-cyborg bot added area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler kind:documentation labels Nov 18, 2024
@jscheffl jscheffl added the type:bug-fix Changelog: Bug Fixes label Nov 18, 2024
@dstandish
Copy link
Contributor

might need this as well @jscheffl #44093

@jscheffl
Copy link
Contributor Author

might need this as well @jscheffl #44093

Yeeah, figured out the same commit right at the same time :-D Added to the PR!

@jscheffl jscheffl merged commit 341d36d into apache:v2-10-test Nov 19, 2024
48 checks passed
utkarsharma2 pushed a commit that referenced this pull request Dec 4, 2024
…44158)

* [v2-10-test] Re-queue tassk when they are stuck in queued (#43520)

The old "stuck in queued" logic just failed the tasks.  Now we requeue them.  We accomplish this by revoking the task from executor and setting state to scheduled.  We'll re-queue it up to 2 times.  Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc.  We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action?  Anyway this avoids having to deal with "state mismatch" issues when processing events.

---------

(cherry picked from commit a41feeb)

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* fix test_handle_stuck_queued_tasks_multiple_attempts (#44093)

---------

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: GPK <gopidesupavan@gmail.com>
utkarsharma2 pushed a commit that referenced this pull request Dec 9, 2024
…44158)

* [v2-10-test] Re-queue tassk when they are stuck in queued (#43520)

The old "stuck in queued" logic just failed the tasks.  Now we requeue them.  We accomplish this by revoking the task from executor and setting state to scheduled.  We'll re-queue it up to 2 times.  Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc.  We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action?  Anyway this avoids having to deal with "state mismatch" issues when processing events.

---------

(cherry picked from commit a41feeb)

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* fix test_handle_stuck_queued_tasks_multiple_attempts (#44093)

---------

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: GPK <gopidesupavan@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler kind:documentation type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants