Fix: training resume from fp16 for SDXL Consistency Distillation #6840

asrimanth · 2024-02-04T05:20:51Z

What does this PR do?

Part of #6552 for SDXL Consistency Distillation

Before submitting

Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue? Yes, here's the discussion
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

What to test?

First run this:

accelerate launch train_lcm_distill_lora_sdxl.py \
  --pretrained_teacher_model="stabilityai/stable-diffusion-xl-base-1.0"  \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --output_dir="pokemons-lora-lcm-sdxl" \
  --mixed_precision="fp16" \
  --dataset_name="lambdalabs/pokemon-blip-captions" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --use_8bit_adam \
  --lora_rank=16 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=4 --checkpointing_steps=2 --checkpoints_total_limit=2 \
  --seed="0"

To resume training, run the following command:

accelerate launch train_lcm_distill_lora_sdxl.py \
  --pretrained_teacher_model="stabilityai/stable-diffusion-xl-base-1.0"  \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --output_dir="pokemons-lora-lcm-sdxl" \
  --mixed_precision="fp16" \
  --dataset_name="lambdalabs/pokemon-blip-captions" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --use_8bit_adam \
  --lora_rank=16 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=8 --checkpointing_steps=2 --checkpoints_total_limit=2 \
  --seed="0" \
  --resume_from_checkpoint="latest"

Who can review?

@sayakpaul

sayakpaul · 2024-02-04T12:47:03Z

Thanks for your contribution. Could we please reduce the number of steps to quickly check this?

HuggingFaceDocBuilderDev · 2024-02-04T12:54:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

asrimanth · 2024-02-04T16:53:09Z

Do you mean reducing the max_train_steps or the checkpointing_steps? FYI, the above-mentioned tests work for me in my local machine. Also, how do I fix the code quality check?

sayakpaul · 2024-02-04T16:58:13Z

Do you mean reducing the max_train_steps or the checkpointing_steps? FYI, the above-mentioned tests work for me in my local machine. Also, how do I fix the code quality check?

Yeah. I am sure the above command will work but will take longer to validate. The effectivity can essentially be tested with far fewer steps, as I did here. Also, please run the code quality linter by running make style && make quality.

asrimanth · 2024-02-04T17:17:34Z

I've pushed a commit for the code style, and I've edited the test command to save a checkpoint for fewer number of steps, as per your example.

sayakpaul · 2024-02-04T17:49:00Z

I don't think I made myself super clear. Sorry.

Your example commands still have max_train_steps set to 3000. Why though? Here, I have it set to 4 8 respectively along with checkpointing_steps and checkpoints_total_limit set accordingly. I don't have any unnecessary arguments here such as report_to.

Can we please keep the example commands super lean?

I hope I made myself more clear this time.

asrimanth · 2024-02-04T18:28:42Z

Yeah, makes sense! I've made the changes so that the training command is leaner and more efficient.

sayakpaul · 2024-02-05T02:44:22Z

I just tried it and I am facing shape mismatch problems while running the second command.

asrimanth · 2024-02-07T00:05:34Z

Updates: I just got a Runpod machine with higher VRAM for testing. I was able to test the updated script with lower settings. I also made some changes to the code and it works on my local machine. Please test it out and let me know.

sayakpaul · 2024-02-07T05:46:10Z

Thank you! But now, I am seeing some state dict key mismtaches when I am resuming training. Could you look into that?

asrimanth · 2024-02-08T04:23:20Z

Pushed a new commit to fix the missing keys issue. Please have a look and let me know.

sayakpaul · 2024-02-08T04:34:30Z

examples/consistency_distillation/train_lcm_distill_lora_sdxl.py

@@ -305,7 +310,7 @@ def parse_args():
    parser.add_argument(
        "--cache_dir",
        type=str,
-        default=None,
+        default="/workspace/cache",


We don't need to change this.

Forgot to remove it after local testing. Reverted it to None and pushed a commit. Sorry for the inconvenience.

sayakpaul · 2024-02-08T05:06:33Z

Thanks so much for iterating. I have checked and it works: https://colab.research.google.com/gist/sayakpaul/fd2e863c9911031ad01fa9cf6863a5da/scratchpad.ipynb.

Will merge the PR once the test suite passes.

asrimanth · 2024-02-08T05:31:42Z

Thank you for the consistent feedback. Happy to contribute to HuggingFace.

…gingface#6840) * Fix: training resume from fp16 for lcm distill lora sdxl * Fix coding quality - run linter * Fix 1 - shift mixed precision cast before optimizer * Fix 2 - State dict errors by removing load_lora_into_unet * Update train_lcm_distill_lora_sdxl.py - Revert default cache dir to None --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Fix: training resume from fp16 for lcm distill lora sdxl

b532291

asrimanth changed the title ~~Fix: training resume from fp16 for lcm distill lora sdxl~~ Fix: training resume from fp16 for SDXL Consistency Distillation Feb 4, 2024

sayakpaul and others added 2 commits February 4, 2024 22:28

Merge branch 'main' into fix-fp16-train-resume-lcm-sdxl

53c0a07

Fix coding quality - run linter

cbea2b1

asrimanth and others added 2 commits February 6, 2024 18:44

Merge branch 'huggingface:main' into fix-fp16-train-resume-lcm-sdxl

df94b62

Fix 1 - shift mixed precision cast before optimizer

d5ed335

asrimanth and others added 2 commits February 8, 2024 01:42

Fix 2 - State dict errors by removing load_lora_into_unet

e6a1f82

Merge branch 'main' into fix-fp16-train-resume-lcm-sdxl

efa505e

sayakpaul reviewed Feb 8, 2024

View reviewed changes

Update train_lcm_distill_lora_sdxl.py - Revert default cache dir to None

99e5290

Merge branch 'main' into fix-fp16-train-resume-lcm-sdxl

c539ac7

sayakpaul merged commit a11b0f8 into huggingface:main Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: training resume from fp16 for SDXL Consistency Distillation #6840

Fix: training resume from fp16 for SDXL Consistency Distillation #6840

asrimanth commented Feb 4, 2024 •

edited

Loading

sayakpaul commented Feb 4, 2024

HuggingFaceDocBuilderDev commented Feb 4, 2024

asrimanth commented Feb 4, 2024 •

edited

Loading

sayakpaul commented Feb 4, 2024

asrimanth commented Feb 4, 2024

sayakpaul commented Feb 4, 2024

asrimanth commented Feb 4, 2024

sayakpaul commented Feb 5, 2024

asrimanth commented Feb 7, 2024

sayakpaul commented Feb 7, 2024

asrimanth commented Feb 8, 2024 •

edited

Loading

sayakpaul Feb 8, 2024

asrimanth Feb 8, 2024 •

edited

Loading

sayakpaul commented Feb 8, 2024

asrimanth commented Feb 8, 2024

Fix: training resume from fp16 for SDXL Consistency Distillation #6840

Fix: training resume from fp16 for SDXL Consistency Distillation #6840

Conversation

asrimanth commented Feb 4, 2024 • edited Loading

What does this PR do?

Before submitting

What to test?

Who can review?

sayakpaul commented Feb 4, 2024

HuggingFaceDocBuilderDev commented Feb 4, 2024

asrimanth commented Feb 4, 2024 • edited Loading

sayakpaul commented Feb 4, 2024

asrimanth commented Feb 4, 2024

sayakpaul commented Feb 4, 2024

asrimanth commented Feb 4, 2024

sayakpaul commented Feb 5, 2024

asrimanth commented Feb 7, 2024

sayakpaul commented Feb 7, 2024

asrimanth commented Feb 8, 2024 • edited Loading

sayakpaul Feb 8, 2024

Choose a reason for hiding this comment

asrimanth Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

sayakpaul commented Feb 8, 2024

asrimanth commented Feb 8, 2024

asrimanth commented Feb 4, 2024 •

edited

Loading

asrimanth commented Feb 4, 2024 •

edited

Loading

asrimanth commented Feb 8, 2024 •

edited

Loading

asrimanth Feb 8, 2024 •

edited

Loading