train_dreambooth_lora_flux.py distributed bugs #9161

neuron-party · 2024-08-12T21:18:32Z

Describe the bug

AttributeError when running model parallel distributed training with accelerate

Reproduction

accelerate launch --config_file train_dreambooth_lora_flux.py
--resolution=1024
--mixed_precision=bf16
--pretrained_model_name_or_path=black-forest-labels/FLUX.1-dev
--num_validation_images=8
--validation_epochs=100
--rank=16
--train_batch_size=1
--learning_rate=1e-4
--guidance_scale=3.5
--checkpointing_steps=200
--instance_prompt=xyz
--instance_data_dir=xyz
--output_dir=xyz
--logging_dir=xyz
--validation_prompt=xyz

accelerate config:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
use_cpu: false
gpu_ids: '0, 1'
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false

Logs

if transformer.config.guidance_embeds:

AttributeError: DistributedDataParallel object has no attribute config

System Info

diffusers from source
accelerate==0.33.0
transformers==4.44.1

training on A100s

Who can help?

No response

The text was updated successfully, but these errors were encountered:

tolgacangoz · 2024-08-13T18:30:04Z

What happens if you unwrap it:

if accelerator.unwrap_model(transformer).config.guidance_embeds:

maziyarpanahi · 2024-08-19T20:11:04Z

Any progress on this? The full fine-tune gets OOM on 4x A100/80G and the LoRA results in this error.

What happens if you unwrap it:
if accelerator.unwrap_model(transformer).config.guidance_embeds:

This exist in the full fine-tune, there is no config.guidance_embeds inside train_dreambooth_lora_flux.py file.

Adenialzz · 2024-08-20T03:34:53Z

What happens if you unwrap it:
if accelerator.unwrap_model(transformer).config.guidance_embeds:

this works in my case.

But I got an another pytorch oom error in log_validation, even using a800 80g. How can I fix this?

tolgacangoz · 2024-08-20T07:28:58Z

For OOM, see https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_flux.md. Did you add --gradient_checkpointing, --use_8bit_adam, and --gradient_accumulation_steps=4 or 8? Is it possible for you to try without --validation_prompt? Also, could you try to use one of the latest versions of PyTorch?

@maziyarpanahi Could you elaborate? Isn't config.guidance_embeds in train_dreambooth_lora_flux.py:

diffusers/examples/dreambooth/train_dreambooth_lora_flux.py

Line 1656 in 214990e

if transformer.config.guidance_embeds:

maziyarpanahi · 2024-08-20T10:49:34Z

@maziyarpanahi Could you elaborate? Isn't config.guidance_embeds in train_dreambooth_lora_flux.py:

Sorry about that, I was looking in the wrong file. It does indeed exist in the lora file as well.

Adenialzz · 2024-08-22T08:49:49Z

For OOM, see https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_flux.md. Did you add --gradient_checkpointing, --use_8bit_adam, and --gradient_accumulation_steps=4 or 8? Is it possible for you to try without --validation_prompt? Also, could you try to use one of the latest versions of PyTorch?

@maziyarpanahi Could you elaborate? Isn't config.guidance_embeds in train_dreambooth_lora_flux.py:

diffusers/examples/dreambooth/train_dreambooth_lora_flux.py

Line 1656 in 214990e

if transformer.config.guidance_embeds:

thanks. I missed this oom guidance before. It helps a lot.

github-actions · 2024-09-15T15:02:44Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

neuron-party added the bug Something isn't working label Aug 12, 2024

github-actions bot added the stale Issues that haven't received updates label Sep 15, 2024

icsl-Jeon mentioned this issue Oct 1, 2024

Handling mixed precision for dreambooth flux lora training #9565

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_dreambooth_lora_flux.py distributed bugs #9161

train_dreambooth_lora_flux.py distributed bugs #9161

neuron-party commented Aug 12, 2024

tolgacangoz commented Aug 13, 2024

maziyarpanahi commented Aug 19, 2024

Adenialzz commented Aug 20, 2024

tolgacangoz commented Aug 20, 2024 •

edited

Loading

maziyarpanahi commented Aug 20, 2024

Adenialzz commented Aug 22, 2024

github-actions bot commented Sep 15, 2024

train_dreambooth_lora_flux.py distributed bugs #9161

train_dreambooth_lora_flux.py distributed bugs #9161

Comments

neuron-party commented Aug 12, 2024

Describe the bug

Reproduction

Logs

System Info

Who can help?

tolgacangoz commented Aug 13, 2024

maziyarpanahi commented Aug 19, 2024

Adenialzz commented Aug 20, 2024

tolgacangoz commented Aug 20, 2024 • edited Loading

maziyarpanahi commented Aug 20, 2024

Adenialzz commented Aug 22, 2024

github-actions bot commented Sep 15, 2024

tolgacangoz commented Aug 20, 2024 •

edited

Loading