New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

#

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Jump to bottom

[Training] fix training resuming problem when using FP16 (SDXL LoRA DreamBooth) #6514

Merged

sayakpaul merged 11 commits into main from fix-fp16-training-resume

Jan 12, 2024

Member

sayakpaul commented Jan 10, 2024 •

edited

Loading

What does this PR do?

Tries to solve issues like #6442 in a clean way. Limits the changes to the DreamBooth SDXL LoRA script for now.

To test

First run

CUDA_VISIBLE_DEVICES=0 accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \
  --pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix \
  --instance_data_dir="dog" \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --output_dir="lora-trained-sdxl" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 --gradient_checkpointing \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=4 --checkpointing_steps=2 --checkpoints_total_limit=2 \
  --use_8bit_adam \
  --seed="0"

And then resume:

CUDA_VISIBLE_DEVICES=0 accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \
  --pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix \
  --instance_data_dir="dog" \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --output_dir="lora-trained-sdxl" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 --gradient_checkpointing \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=8 --checkpointing_steps=2 --checkpoints_total_limit=2 \
  --resume_from_checkpoint="latest" \
  --use_8bit_adam \
  --seed="0"

Tested with train_text_encoder flag on as well.

I would appreciate if one of the reviewers could cross-check this.

sayakpaul added 2 commits

January 10, 2024 13:28


          fix: training resume from fp16.

fa57c21


          add: comment

sayakpaul requested review from patrickvonplaten and younesbelkada

January 10, 2024 08:06

sayakpaul added 3 commits

January 10, 2024 13:38


          remove residue from another branch.

356ef29


          Merge branch 'main' into fix-fp16-training-resume


          remove more residues.

32ecbee

sayakpaul commented

View reviewed changes

examples/dreambooth/train_dreambooth_lora_sdxl.py

Comment on lines +1103 to +1112

+                  # Make sure the trainable params are in float32.
+                  if args.mixed_precision == "fp16":
+                      models = [unet]
+                      if args.train_text_encoder:
+                          models.extend([text_encoder_one, text_encoder_two])
+                      for model in models:
+                          for param in model.parameters():
+                              # only upcast trainable parameters (LoRA) into fp32
+                              if param.requires_grad:
+                                  param.data = param.to(torch.float32)

Member Author

sayakpaul Jan 10, 2024

We do it just before assigning the parameters to the optimizer to avoid any consequences.

Member Author

sayakpaul Jan 12, 2024

In a follow-up PR, I can wrap this utility into a function and move to training_utils.py. It's shared by a number of scripts.

HuggingFaceDocBuilderDev commented Jan 10, 2024

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul commented

View reviewed changes

examples/dreambooth/train_dreambooth_lora_sdxl.py Outdated Show resolved Hide resolved

sayakpaul requested a review from BenjaminBossan

January 10, 2024 10:56

sayakpaul changed the title ~~[Training] fix training resuming problem when using FP16~~ [Training] fix training resuming problem when using FP16 (SDXL LoRA DreamBooth)


          thanks to Younes; no hacks.

03bef1d

sayakpaul commented

View reviewed changes

examples/dreambooth/train_dreambooth_lora_sdxl.py Show resolved Hide resolved

sayakpaul commented

View reviewed changes

examples/dreambooth/train_dreambooth_lora_sdxl.py

Comment on lines -1061 to -1072

-                      LoraLoaderMixin.load_lora_into_unet(lora_state_dict, network_alphas=network_alphas, unet=unet_)
-                      text_encoder_state_dict = {k: v for k, v in lora_state_dict.items() if "text_encoder." in k}
-                      LoraLoaderMixin.load_lora_into_text_encoder(
-                          text_encoder_state_dict, network_alphas=network_alphas, text_encoder=text_encoder_one_
-                      )

Member Author

sayakpaul Jan 11, 2024

We cannot be using load_lora_into_unet() and load_lora_into_text_encoder() for the following reason (described only for the UNet but applicable to the text encoders, too).

We call add_adapter() once on unet at the beginning of training. This creates an adapter config inside of the UNet.
Then during loading an intermediate checkpoint in the accelerate hook, we call load_lora_into_unet(). It internally again calls inject_adapter_in_model() with the config inferred from the state dict provided. So, it internally creates another adapter. This is undesirable, right?

sayakpaul commented

View reviewed changes

examples/dreambooth/train_dreambooth_lora_sdxl.py

@@ @@ -996,17 +997,6 @@ def main(args): @@
                       text_encoder_one.add_adapter(text_lora_config)
                       text_encoder_two.add_adapter(text_lora_config)
-                  # Make sure the trainable params are in float32.

Member Author

sayakpaul Jan 11, 2024

https://github.com/huggingface/diffusers/pull/6514/files#r1447020705

sayakpaul commented

View reviewed changes

examples/dreambooth/train_dreambooth_lora_sdxl.py Outdated Show resolved Hide resolved

BenjaminBossan approved these changes

View reviewed changes

Member

BenjaminBossan left a comment

Thanks Sayak, LGTM.

examples/dreambooth/train_dreambooth_lora_sdxl.py Show resolved Hide resolved

examples/dreambooth/train_dreambooth_lora_sdxl.py Show resolved Hide resolved

sayakpaul added 4 commits

January 12, 2024 09:55


          merge main and resolve conflicts.

967eeee


          style.

820522f


          clean things a bit and modularize _set_state_dict_into_text_encoder

85e6b6b


          Merge branch 'main' into fix-fp16-training-resume

90d50e4

younesbelkada approved these changes

View reviewed changes

Contributor

younesbelkada left a comment

Looking great ! thanks for all your work on this! I left one comment, wdyt?

examples/dreambooth/train_dreambooth_lora_sdxl.py Show resolved Hide resolved

Member Author

sayakpaul commented Jan 12, 2024

@BenjaminBossan pinging for #6514 (comment).


          add comment about the fix detailed.

e8f1d38

sayakpaul mentioned this pull request

[Tracker] fix training resuming problem when using FP16 in the examples #6552

Open

5 tasks

sayakpaul merged commit 79df503 into main

sayakpaul deleted the fix-fp16-training-resume branch

January 12, 2024 11:41

This was referenced Jan 14, 2024

ValueError: Attempting to unscale FP16 gradients. #6442

Closed

SDXL dreambooth can't be resumed from a checkpoint at fp16 training #5004

Closed

sayakpaul mentioned this pull request

Fix: training resume from fp16 for SDXL Consistency Distillation #6840

Merged

5 tasks

sayakpaul mentioned this pull request

[Advanced LoRA v1.5] fix: gradient unscaling problem #7018

Merged

AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request


          [Training] fix training resuming problem when using FP16 (SDXL LoRA D…

9fee586

…reamBooth) (huggingface#6514)

* fix: training resume from fp16.

* add: comment

* remove residue from another branch.

* remove more residues.

* thanks to Younes; no hacks.

* style.

* clean things a bit and modularize _set_state_dict_into_text_encoder

* add comment about the fix detailed.

# for free to join this conversation on GitHub. Already have an account? # to comment

Labels

None yet