Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ValueError: Attempting to unscale FP16 gradients. #6442

Closed
loboere opened this issue Jan 3, 2024 · 10 comments · Fixed by #6751
Closed

ValueError: Attempting to unscale FP16 gradients. #6442

loboere opened this issue Jan 3, 2024 · 10 comments · Fixed by #6751

Comments

@loboere
Copy link

loboere commented Jan 3, 2024

I am trying to resume training a lora in sdxl but when I try to resume it gives an error ValueError: Attempting to unscale FP16 gradients.
It works the first time but when I resume training it gives me that error

!accelerate launch --mixed_precision="fp16" /content/train_text_to_image_lora_sdxl.py   \
--pretrained_model_name_or_path ${MODEL_NAME} \
--train_data_dir images/ \
--resolution ${RESOLUTION} \
--train_batch_size ${BATCH_SIZE} \
--num_train_epochs ${NUM_STEPS} \
--gradient_accumulation ${GRADIENT_ACCUMULATION} \
--checkpointing_steps 5 \
--resume_from_checkpoint "latest" \
--mixed_precision "fp16" \
--caption_column 'text'

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2024-01-03 21:54:11.134347: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-03 21:54:11.134394: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-03 21:54:11.135989: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-03 21:54:12.546627: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
01/03/2024 21:54:13 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'dynamic_thresholding_ratio', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values.
{'reverse_transformer_layers_per_block', 'attention_type', 'dropout'} was not found in config. Values will be initialized to default values.
01/03/2024 21:55:36 - INFO - __main__ - ***** Running training *****
01/03/2024 21:55:36 - INFO - __main__ -   Num examples = 1
01/03/2024 21:55:36 - INFO - __main__ -   Num Epochs = 50
01/03/2024 21:55:36 - INFO - __main__ -   Instantaneous batch size per device = 1
01/03/2024 21:55:36 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
01/03/2024 21:55:36 - INFO - __main__ -   Gradient Accumulation steps = 4
01/03/2024 21:55:36 - INFO - __main__ -   Total optimization steps = 50
Resuming from checkpoint checkpoint-35
01/03/2024 21:55:36 - INFO - accelerate.accelerator - Loading states from sd-model-finetuned-lora/checkpoint-35
Loading unet.
01/03/2024 21:55:36 - INFO - peft.tuners.tuners_utils - Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All model weights loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All optimizer states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All scheduler states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All dataloader sampler states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - GradScaler state loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.checkpointing - All random states loaded successfully
01/03/2024 21:55:38 - INFO - accelerate.accelerator - Loading in 0 custom states
Steps:  70% 35/50 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/train_text_to_image_lora_sdxl.py", line 1261, in <module>
    main(args)
  File "/content/train_text_to_image_lora_sdxl.py", line 1077, in main
    accelerator.clip_grad_norm_(params_to_optimize, args.max_grad_norm)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps:  70% 35/50 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/train_text_to_image_lora_sdxl.py', '--pretrained_model_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0', '--train_data_dir', 'images/', '--resolution', '1024', '--train_batch_size', '1', '--num_train_epochs', '50', '--gradient_accumulation', '4', '--checkpointing_steps', '5', '--resume_from_checkpoint', 'latest', '--mixed_precision', 'fp16', '--caption_column', 'text']' returned non-zero exit status 1.
@rfan-debug
Copy link

You'll need to cast the torch.float16 to torch.float32 for trainable parameters.

Refer to this code block: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora_sdxl.py#L632-L641

@loboere
Copy link
Author

loboere commented Jan 3, 2024

Excuse me, what should I do then?

@sayakpaul
Copy link
Member

Should have been fixed by: #6231. Could you please pull in the latest changes?

@AeroDEmi
Copy link

AeroDEmi commented Jan 9, 2024

I'm using the latest pull and using the --resume_from_checkpoint flag intrain_dreambooth_lora_sdxl.py still brings this error.

@sayakpaul
Copy link
Member

Cc: @SunMarc

@sayakpaul
Copy link
Member

This is on top of our head. We need tp solve #6510 first and then it should be a breeze.

@loboere
Copy link
Author

loboere commented Jan 14, 2024

Any progress or temporary solution?

@sayakpaul
Copy link
Member

Refer to #6514 and #6552.

@levi
Copy link
Contributor

levi commented Jan 16, 2024

@sayakpaul why do the trainable params need to be in float32? Is this a new requirement with the integration of PEFT or has this always been the case? I can't remember needing to do this in the past.

@sayakpaul
Copy link
Member

That's the case. If you remember, we never manually casted the unet and other models to which LoRA params were attached. accelerator.prepare() used to handle things for us.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants