Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix: VRAM mismanagement band-aid #269

Merged
merged 2 commits into from
Jun 26, 2023
Merged

fix: VRAM mismanagement band-aid #269

merged 2 commits into from
Jun 26, 2023

Conversation

tazlin
Copy link
Member

@tazlin tazlin commented Jun 26, 2023

See #268 for some background.

This change will:

  • Instigate a worker 'soft restart', forcibly unloading all compvis models in certain situations:
    • If a job is stale
    • If an impossible number of models are reported as being in VRAM
  • If more than 15 soft restarts occur, the worker is halted.
  • Tracks consecutive_failed_jobs across soft restarts.

tazlin added 2 commits June 26, 2023 16:46
This is a band-aid, similar to the LoRa 'rescue', intended to more forcibly get any SD models out of ram/VRAM on a stale job. A more in-depth fix is in the works on the hordelib side, as I suspect the real issue here has to do with some regression I have introduced in hordelib 1.6.x, rather than a problem with the worker per se.
In an ideal world, hordelib would be behaving. However, this bit of defensive programming on the worker side will mitigate the problems now and perhaps minimize future worker downtime if hordelib model management goes wonky for any reason.
@tazlin tazlin added the release:patch Release with a patch version bump label Jun 26, 2023
@tazlin tazlin merged commit d0faa11 into main Jun 26, 2023
@tazlin tazlin deleted the stale-job-bandaid branch July 31, 2023 12:37
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
release:patch Release with a patch version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant