-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
With examples/dreambooth /README_flux.md guide setting up and training, got cuda OOM with 3090Ti 24GB #9732
Comments
I don't think that the flux dreambooth training scripts are memory-optimized out of the box. You could try using it with DeepSpeed and enabling gradient checkpointing, which should lower the memory requirements by a lot. For serious training experiments, we recommend using something like SimpleTuner which uses diffusers as a backend and supports many important training related components easily and is memory-efficient. |
You could also give our quantization example a try and let us know how it goes: |
Can you try regarding #9829 ? I have saved memory by implementing this :) |
[a-r-r-o-w]
SimpleTuner easily got stuck when doing: $ poetry install ... so giving up. |
Just checked the changes, looks awesome! It must be helpful. But I've stepped over using ai-toolkit single GPU trained weeks ago first, moving to struggle on diffusers inferencing Flux + Lora + ControlNet openpose now, that also facing OOM, should use fp8 or schnell for that, I'll created another ticket for that issue soon.. Will be back to test and verify this some more days later. |
@riflemanl I used bf16 with the deepspeed and accelerate, it should work. Another reason is the FLUX is 12B model, it costs lots of memory |
@leisuzz : Oh!? I just tried the patch, it still got OOM, I found that you tried to images = None and del pipeline at the end, but I got OOM at begining of training when trying to prepare allocation VRAM, I checked the nvidia-smi, its memory usage from 1GB to 24GB in 5 seconds and then crashed.. My accelerate launch parameters as following: |
24GB is not enought, what's your hardware setting? Try to reduce the batch size and resolution |
|
Batch size 1 with resolution 256 will cost around 40GB on my 8 GPU. I think you should try off loads to cpu |
Where I should add off loads to cpu code? |
Try with deep speed, but I don't think one gpu is enough |
But ai-toolkit can train 1024 Flux.1-dev Lora without problem, but it cannot utilize my 2 GPUs to make it faster, so, I'm trying diffusers + accelerate here. If diffusers cannot make it, or even can make with 256 only, that makes me can only giving up.. |
It also depends on the size of the dataset |
This should easily fit within a 24GB GPU: I currently don't have the bandwidth to debug this further but if you don't want to use quantization, you can consider using other trainers like https://github.com/ostris/ai-toolkit/. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Closing this due to inactivity. |
you can change ”--num_validation_images“ 1 |
Describe the bug
Followed the guide examples/dreambooth/README_flux.md guide setting up and training, got cuda OOM with 3090Ti 24GB.
Reproduction
PC got 256GB RAM
3090Ti VRAM 24GB
torch 2.4.1 + cuda 12.1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
accelerate==1.0.1
transformers==4.45.2
Logs
System Info
Diffusers version is latest main branch code today, 2024-10-21, coz previous release tag still not yet support dreambooth Flux Lora training.
Who can help?
No response
The text was updated successfully, but these errors were encountered: