Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Workflows stuck in delayed queue since hours/days and never gets executed #6230

Open
zsmanjot opened this issue Aug 31, 2024 · 14 comments
Open

Comments

@zsmanjot
Copy link

Hi There,
If anyone could help me with an issue, we are using ST2 extensively and many workflows runs on the box. The box has good configuration, in terms of memory and CPU.

It has been noticed that workflows gets queued and never gets executed and delayed queue is far too long always when checked.

For example:

If i check the delay queue running the following command (st2 execution list -l --status delayed) , i would be able to see workflows for 2 days before that never made to execution. Because of this , it is seen that other workflows also gets impacted in a way that it takes 50 minutes for a simple workflow to finish that generally takes 10 minutes.

Anybody who can help me here?

Example:

image
@zsmanjot
Copy link
Author

zsmanjot commented Sep 6, 2024

The problem is getting increased as there is a lot of delay. I have checked it and found mongodb is running on high CPU here.

Any ideas what can be done here? I know that the triggers are too much these days to handle , so could it be the reason? If yes, how we can address this?

image

@zsmanjot
Copy link
Author

zsmanjot commented Sep 6, 2024

Also i could see that in DB i have 4767 workflows in delayed state.

image

@zsmanjot
Copy link
Author

zsmanjot commented Sep 6, 2024

@arm4b Any solution here?

@zsmanjot zsmanjot changed the title Workflows stuck in delayed queue since hours and never gets executed Workflows stuck in delayed queue since hours/days and never gets executed Sep 6, 2024
@chain312
Copy link
Contributor

chain312 commented Sep 6, 2024

Can you show what state your movements are in? I asked the same question today at Slack. I have been troubled for a long time and I am still trying to solve it.

@zsmanjot
Copy link
Author

zsmanjot commented Sep 6, 2024

My workflow never makes it to execution , the older ones. If any how it makes then it just keeps on holding at some or the other task for hours and completed by 2 or 3 Hours.

@zsmanjot
Copy link
Author

zsmanjot commented Sep 6, 2024

Also, i am getting this error as well.

root@stackstorm:~# st2 execution list -l --status delayed -n 2000 2>/dev/null
ERROR: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

@guzzijones
Copy link
Contributor

This is a known issue. If rabbitmq retry connections are exhausted then an action is stuck running forever. likely your box is experiencing some network issues internally. Do your workflows create very large context or have very large inputs or outputs?

@zsmanjot
Copy link
Author

Thanks @guzzijones for replying.
No underlying network issues are there. Regarding the large inputs and outputs , no these are not very huge. But the things that has been noticed is the amount of triggers it is receiving now a days is huge.

But the main concern is ST2 keeps them in queues for days and never even executes it.

@chain312
Copy link
Contributor

谢谢你的回复。不存在底层网络问题。关于 大的输入和输出 ,不,这些不是很大。但已经注意到的是,它现在每天收到的触发数量是巨大的。

但主要问题是 ST2 让他们排队好几天,甚至从不执行它。

Can you see what state most workflow instances are in?

@zsmanjot
Copy link
Author

They are all stuck in delayed state. More than 5000 workflows.

@guzzijones
Copy link
Contributor

What is in your st2-workflow-engine logs and st2-action-runner logs. I bet you see disconnects to rabbit-mq.

@zsmanjot
Copy link
Author

@guzzijones
No i could not see rabbit-mq disconnects. Even if i try to purge older workflows it does not do anything and i have to grep IDs and cancel these older workflows manually.

This is a big performance issue.

@zsmanjot
Copy link
Author

This is one of the example:

image

See the requested and scheduled time , 3 hours delay. How could we reduce this delay ? What are the factors that might be we are missing here? Any ideas?

@chain312
Copy link
Contributor

This is one of the example:

image See the requested and scheduled time , 3 hours delay. How could we reduce this delay ? What are the factors that might be we are missing here? Any ideas?

It should be blocked. If the performance problem cannot be solved, you can add filtering rules to the rule to confirm which data needs to be processed automatically.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants