Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

WAIT Task inside DO_WHILE causing infinite task creation which are already completed #3876

Open
appunni-old opened this issue Dec 2, 2023 · 4 comments
Labels
type: bug bugs/ bug fixes

Comments

@appunni-old
Copy link

Describe the bug
While running the below workflow it goes into infinite loop

Details
Conductor version: 3.15.0
Persistence implementation: Postgres and MySQL
Queue implementation: MySQL and Postgres
Lock: Redis
Workflow definition:

{
  "createTime": 1701489520469,
  "createdBy": "owner@email.com",
  "updatedBy": "owner@email.com",
  "accessPolicy": {},
  "name": "test_do_while",
  "description": "Workflow details",
  "version": 1,
  "tasks": [
    {
      "name": "default__do_while",
      "taskReferenceName": "task_1__loop_databricks",
      "inputParameters": {},
      "type": "DO_WHILE",
      "startDelay": 0,
      "optional": false,
      "asyncComplete": false,
      "loopCondition": "if ($.task_1__loop_databricks['iteration'] < 200) { true; } else { false; }",
      "loopOver": [
        {
          "name": "default__sleep",
          "taskReferenceName": "task_1__wait_databricks",
          "inputParameters": {
            "duration": "20 seconds",
            "tenantId": "csit"
          },
          "type": "WAIT",
          "startDelay": 0,
          "optional": false,
          "asyncComplete": false
        }
      ]
    }
  ],
  "inputParameters": [],
  "outputParameters": {},
  "schemaVersion": 2,
  "restartable": true,
  "workflowStatusListenerEnabled": false,
  "ownerEmail": "owner@email.com",
  "timeoutPolicy": "ALERT_ONLY",
  "timeoutSeconds": 0,
  "variables": {},
  "inputTemplate": {}
}

Error in conductor server

conductor-server          | 2023-12-02 04:56:48.224 ERROR 13 --- [m-task-worker-8] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 94c1d30a-aef6-4861-be18-4fbfcd03743c could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.227 ERROR 13 --- [-task-worker-11] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 8c683b28-0c10-42dd-894b-2aebead3e3e8 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.234 ERROR 13 --- [m-task-worker-9] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 976eb165-97af-4451-800b-b506341bd938 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.250 ERROR 13 --- [-task-worker-10] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 86de75ad-195b-4d18-86f6-7b1280702751 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.293 ERROR 13 --- [-task-worker-12] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 822de4fa-2291-4516-82b9-bd6ef7f8b0ac could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.409 ERROR 13 --- [-task-worker-13] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 7ed85099-e58f-45e1-845a-c44e141113e5 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.431 ERROR 13 --- [-task-worker-14] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 784e801d-fcb6-488a-b575-6d476c86a6aa could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.466 ERROR 13 --- [-task-worker-15] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 75e2b4ac-225e-447b-b8ff-b9ab5164c642 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.744 ERROR 13 --- [-task-worker-16] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: a2a4b709-7c51-4b24-a26f-f49cffbcf877 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.880 ERROR 13 --- [-task-worker-17] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 73526e0e-0ede-4bda-8e4a-9d77d250e947 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.103 ERROR 13 --- [-task-worker-18] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: a47dded2-3d7c-4d25-82eb-021cdd19f288 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.120 ERROR 13 --- [-task-worker-20] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: a725f7d5-da1c-4d42-a453-0550592f8b06 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.121 ERROR 13 --- [-task-worker-21] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: ab0319c0-cb0f-49f5-8a7e-99c04fef1809 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.123 ERROR 13 --- [-task-worker-22] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: abdb849e-6fd9-43cd-b924-5415a933e6bb could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.189 ERROR 13 --- [-task-worker-23] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: b2a69eb2-bea1-4764-b986-ad75bb82e9dc could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.191 ERROR 13 --- [-task-worker-24] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: c9410bb4-d96e-4ca9-a6be-bd45e6f0ea53 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.192 ERROR 13 --- [m-task-worker-1] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: b3bfbe2f-9c32-41aa-89aa-08bed04c47ce could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.194 ERROR 13 --- [m-task-worker-2] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: b4b10f44-1530-4a93-971d-6025762d837b could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.196 ERROR 13 --- [m-task-worker-3] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: c5496ec6-2730-43eb-a265-86b06ae35807 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.197 ERROR 13 --- [-task-worker-23] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: c3369393-abb7-4c1f-907d-4d02eda5e9a4 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.207 ERROR 13 --- [m-task-worker-5] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: c26ffe62-4c7f-4d64-a1d5-ef203afc4272 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.210 ERROR 13 --- [m-task-worker-6] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: b6b9aef8-e941-48e2-a6

To Reproduce
Just goto UI http://localhost:5000
Create the above task definition
Goto workbench
Just trigger this workflow
WARNING - This creates an Infinite loop situation only use this with local conductor setup which can be deleted

Expected behavior
Loop runs and waits for 20 seconds between loop

Screenshots
The workflow is stuck not moving forward.

Additional context
Add any other context about the problem here.

@appunni-old appunni-old added the type: bug bugs/ bug fixes label Dec 2, 2023
@appunni-old
Copy link
Author

Not able to replicate in orkes platform

@appunni-old
Copy link
Author

I debugged it by running line by line, attaching first lines as well

595060 [sweeper-thread-24] INFO  com.netflix.conductor.core.reconciliation.WorkflowRepairService [] - Task 46abe269-5daf-403a-9b15-cbd7878b8bed in workflow 7d137e5b-304e-449c-9607-6413bfee8fd0 re-queued for repairs
667288 [HikariPool-1 housekeeper] WARN  com.zaxxer.hikari.pool.HikariPool [] - HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m16s793ms).
686827 [system-task-worker-2] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 1445ba4c-0bd5-4826-a359-984fd4da86a5 could not be found while executing WAIT
692015 [system-task-worker-3] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 05cbf978-86e3-48ef-b5cf-52b481edd5f5 could not be found while executing WAIT
699409 [system-task-worker-4] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 95ffee82-0cc9-468a-8ce8-af7b1d8438c1 could not be found while executing WAIT
700895 [system-task-worker-5] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: d7f9d0a7-3525-4eff-a07a-179bc57ab349 could not be found while executing WAIT
701862 [system-task-worker-7] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 75811fa6-ec79-40c3-9136-88b33a3a53f3 could not be found while executing WAIT
702397 [system-task-worker-6] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 80b2e11a-4f28-4f28-8737-26d1d7abd010 could not be found while executing WAIT
702762 [system-task-worker-9] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 58055094-5e0d-4613-beb6-078f940994fa could not be found while executing WAIT

Oh sorry this is broken, I ran it in orkes platform, it went to same loop. I regret now, I should have been more careful. Can some one help ?

@appunni-old
Copy link
Author

appunni-old commented Dec 3, 2023

And I definitely think it's something to do with the config, because I created same via UI and it worked completely fine. In orkes default cluster task limit was 1000, but this created 7552. I terminated the workflow. Otherwise it would have kept running.

@appunni-old
Copy link
Author

appunni-old commented Dec 3, 2023

Issue Identified: This happens when task reference name has double underscore. Which means this will evaluate false. We should have validation when accepting taskReference names not to have double underscore on workflow definition or on the Start workflow API

        for (TaskModel t : workflow.getTasks()) {
            if (doWhileTaskModel
                            .getWorkflowTask()
                            .has(TaskUtils.removeIterationFromTaskRefName(t.getReferenceTaskName()))
                    && !doWhileTaskModel.getReferenceTaskName().equals(t.getReferenceTaskName())
                    && doWhileTaskModel.getIteration() == t.getIteration()) {
                relevantTask = relevantTasks.get(t.getReferenceTaskName());
                if (relevantTask == null || t.getRetryCount() > relevantTask.getRetryCount()) {
                    relevantTasks.put(t.getReferenceTaskName(), t);
                }
            }
        }

TaskUtils.removeIterationFromTaskRefName(t.getReferenceTaskName())

Is the culprit as it tries to fetch the task id by splitting DELIMITER ie "__".

    public static String removeIterationFromTaskRefName(String referenceTaskName) {
        String[] tokens = referenceTaskName.split(TaskUtils.LOOP_TASK_DELIMITER);
        return tokens.length > 0 ? tokens[0] : referenceTaskName;
    }

This leads to an infinite loop condition, creating infinite tasks

appunni-old added a commit to appunni-old/conductor that referenced this issue Dec 3, 2023
Parsing name is not considering task ref name with double underscores

- This is not fully fixing the use of this function. There needs to be some kind of validation against the user from setting up the taskRefName with double underscore
appunni-old added a commit to appunni-old/conductor that referenced this issue Dec 3, 2023
Fix: Netflix#3876 Update task utils: removeIterationFromTaskRefName
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
type: bug bugs/ bug fixes
Projects
None yet
Development

No branches or pull requests

1 participant