-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Model management regression, hordelib not unloading models correctly #268
Comments
Ye I did notice that user mentioning it. Hopefully it's an easy fix |
Part of the resolution of this issue should be to extricate the increased degree that the worker is reaching into hordelib. I am merely doing so to triage the issue. Hordelib should have its own facilities to do what I am synthesizing on the worker side with the diff of #269 . |
Just FYI, currently it's not possible to kill hanging inference threads because the executor always opens them as non-demons. How did you get around it with the soft-restart? |
I didn't. This solution will only work with any reliability if the inference threads exit normally or terminate with an exception. |
The tiled vae just takes a long time. It may appear hung, but if the gpu is 100% load it is in fact doing inference and the thread will exit normally, albeit way after the time limit for the job. The soft restart should just prevent that happening again for a long time. If it doesn't, then not enough vram to keep free is configured to start with, probably. |
A regression to be sure, certainly caused by me, possibly to do with thread safety in hordelib.
The issue seems to be associated with stale jobs. I suspect that the tile VAE feature is kicking in when VRAM is running very low, on account of an unusual number of models being reported as being in VRAM, which is causing jobs to spool to RAM during inference and take inordinately long.
For example, on P5000 (with 6gb VRAM):
Which is patently absurd.
What appears to be the same issue was also reported by a user with a 4090.
I have not had more than these two reports so far, so it is possible it was a change in the past couple days after v23. I need more time to investigate.
The text was updated successfully, but these errors were encountered: