Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Model management regression, hordelib not unloading models correctly #268

Open
tazlin opened this issue Jun 26, 2023 · 5 comments
Open

Model management regression, hordelib not unloading models correctly #268

tazlin opened this issue Jun 26, 2023 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@tazlin
Copy link
Member

tazlin commented Jun 26, 2023

A regression to be sure, certainly caused by me, possibly to do with thread safety in hordelib.

The issue seems to be associated with stale jobs. I suspect that the tile VAE feature is kicking in when VRAM is running very low, on account of an unusual number of models being reported as being in VRAM, which is causing jobs to spool to RAM during inference and take inordinately long.

For example, on P5000 (with 6gb VRAM):

DEBUG      | 2023-06-26 08:10:15.399087 | comfy.model_management:have_free_vram:227 - Free VRAM is: 705MB (13 models loaded on GPU)
DEBUG      | 2023-06-26 08:10:15.775112 | comfy.model_management:load_model_gpu:243 - Unloaded a model, free VRAM is now: 2359MB (12 models loaded on GPU)
INFO       | 2023-06-26 08:10:43.523285 | worker.workers.framework:check_running_job_status:206 - Estimated average kudos per hour: 11401
INFO       | 2023-06-26 08:11:13.745666 | worker.workers.framework:check_running_job_status:206 - Estimated average kudos per hour: 11401
WARNING    | 2023-06-26 08:11:15.491664 | worker.workers.framework:check_running_job_status:187 - Restarting all jobs, as a job is stale : 160.000s
DEBUG      | 2023-06-26 08:14:41.983356 | hordelib.comfy_horde:send_sync:556 - executing, {'node': 'vae_decode', 'prompt_id': '364c0468-8ad3-4cdf-9f22-464b7f05ac94'}, 364c0468-8ad3-4cdf-9f22-464b7f05ac94

Which is patently absurd.

What appears to be the same issue was also reported by a user with a 4090.

I have not had more than these two reports so far, so it is possible it was a change in the past couple days after v23. I need more time to investigate.

@db0
Copy link
Member

db0 commented Jun 26, 2023

Ye I did notice that user mentioning it. Hopefully it's an easy fix

@tazlin
Copy link
Member Author

tazlin commented Jun 26, 2023

Part of the resolution of this issue should be to extricate the increased degree that the worker is reaching into hordelib. I am merely doing so to triage the issue. Hordelib should have its own facilities to do what I am synthesizing on the worker side with the diff of #269 .

@tazlin tazlin added the bug Something isn't working label Jun 26, 2023
@tazlin tazlin self-assigned this Jun 26, 2023
@db0
Copy link
Member

db0 commented Jun 27, 2023

Just FYI, currently it's not possible to kill hanging inference threads because the executor always opens them as non-demons. How did you get around it with the soft-restart?

@tazlin
Copy link
Member Author

tazlin commented Jun 27, 2023

I didn't. This solution will only work with any reliability if the inference threads exit normally or terminate with an exception.

@tazlin
Copy link
Member Author

tazlin commented Jun 27, 2023

The tiled vae just takes a long time. It may appear hung, but if the gpu is 100% load it is in fact doing inference and the thread will exit normally, albeit way after the time limit for the job. The soft restart should just prevent that happening again for a long time. If it doesn't, then not enough vram to keep free is configured to start with, probably.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants