Model management regression, hordelib not unloading models correctly #268

tazlin · 2023-06-26T22:26:16Z

A regression to be sure, certainly caused by me, possibly to do with thread safety in hordelib.

The issue seems to be associated with stale jobs. I suspect that the tile VAE feature is kicking in when VRAM is running very low, on account of an unusual number of models being reported as being in VRAM, which is causing jobs to spool to RAM during inference and take inordinately long.

For example, on P5000 (with 6gb VRAM):

DEBUG      | 2023-06-26 08:10:15.399087 | comfy.model_management:have_free_vram:227 - Free VRAM is: 705MB (13 models loaded on GPU)
DEBUG      | 2023-06-26 08:10:15.775112 | comfy.model_management:load_model_gpu:243 - Unloaded a model, free VRAM is now: 2359MB (12 models loaded on GPU)
INFO       | 2023-06-26 08:10:43.523285 | worker.workers.framework:check_running_job_status:206 - Estimated average kudos per hour: 11401
INFO       | 2023-06-26 08:11:13.745666 | worker.workers.framework:check_running_job_status:206 - Estimated average kudos per hour: 11401
WARNING    | 2023-06-26 08:11:15.491664 | worker.workers.framework:check_running_job_status:187 - Restarting all jobs, as a job is stale : 160.000s
DEBUG      | 2023-06-26 08:14:41.983356 | hordelib.comfy_horde:send_sync:556 - executing, {'node': 'vae_decode', 'prompt_id': '364c0468-8ad3-4cdf-9f22-464b7f05ac94'}, 364c0468-8ad3-4cdf-9f22-464b7f05ac94

Which is patently absurd.

What appears to be the same issue was also reported by a user with a 4090.

I have not had more than these two reports so far, so it is possible it was a change in the past couple days after v23. I need more time to investigate.

The text was updated successfully, but these errors were encountered:

db0 · 2023-06-26T22:30:21Z

Ye I did notice that user mentioning it. Hopefully it's an easy fix

tazlin · 2023-06-26T22:42:04Z

Part of the resolution of this issue should be to extricate the increased degree that the worker is reaching into hordelib. I am merely doing so to triage the issue. Hordelib should have its own facilities to do what I am synthesizing on the worker side with the diff of #269 .

db0 · 2023-06-27T07:57:59Z

Just FYI, currently it's not possible to kill hanging inference threads because the executor always opens them as non-demons. How did you get around it with the soft-restart?

tazlin · 2023-06-27T10:44:05Z

I didn't. This solution will only work with any reliability if the inference threads exit normally or terminate with an exception.

tazlin · 2023-06-27T10:52:44Z

The tiled vae just takes a long time. It may appear hung, but if the gpu is 100% load it is in fact doing inference and the thread will exit normally, albeit way after the time limit for the job. The soft restart should just prevent that happening again for a long time. If it doesn't, then not enough vram to keep free is configured to start with, probably.

tazlin mentioned this issue Jun 26, 2023

fix: VRAM mismanagement band-aid #269

Merged

tazlin added the bug Something isn't working label Jun 26, 2023

tazlin self-assigned this Jun 26, 2023

tazlin mentioned this issue Jul 19, 2023

Carry over AI-Horde-Worker issues Haidra-Org/horde-sdk#12

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model management regression, hordelib not unloading models correctly #268

Model management regression, hordelib not unloading models correctly #268

tazlin commented Jun 26, 2023 •

edited

Loading

db0 commented Jun 26, 2023

tazlin commented Jun 26, 2023

db0 commented Jun 27, 2023

tazlin commented Jun 27, 2023

tazlin commented Jun 27, 2023

Model management regression, hordelib not unloading models correctly #268

Model management regression, hordelib not unloading models correctly #268

Comments

tazlin commented Jun 26, 2023 • edited Loading

db0 commented Jun 26, 2023

tazlin commented Jun 26, 2023

db0 commented Jun 27, 2023

tazlin commented Jun 27, 2023

tazlin commented Jun 27, 2023

tazlin commented Jun 26, 2023 •

edited

Loading