[BUG] KeyError in stage_1_and_2.py when training dreambooth with deepspeed (in kohya_ss) #3718

me-fraud · 2023-06-08T21:34:27Z

Hello!

I've encountered an issue trying to run dreambooth training with deepspeed in kohya_ss.

I am running into the error, which seems to occure inside the deepspeed stage_1_and_2.py line 508 - 509:
lp_name = self.param_names[lp]
param_mapping_per_group[lp_name] = lp._hp_mapping.get_hp_fragment_address()

Additionaly i've tried to try - except these lines to see what happens, but ran into issues in the next parts of the code in engine.py (although not sure if it is somehow related).

my configuration is:
1GPU RTX 3060 (12Gb VRAM)
WSL2 Ubuntu 22.04 in Windows 11
Cuda 11.7
Python 3.10.6
Torch 2.0.1+cu117
Accelerate 0.19.0
deepspeed 0.8.3 (although the problem is the same with 0.9.3)
in training settings precision is set to fp16

deepspeed configuration JSON:
{
"zero_optimization": {
"stage": 2
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"overlap_comm": true,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}

console output:

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/me/kohya_ss/train_db.py:482 in │
│ │
│ 479 │ args = parser.parse_args() │
│ 480 │ args = train_util.read_config_from_file(args, parser) │
│ 481 │ │
│ ❱ 482 │ train(args) │
│ 483 │
│ │
│ /home/me/kohya_ss/train_db.py:202 in train │
│ │
│ 199 │ │
│ 200 │ # acceleratorがなんかよろしくやってくれるらしい │
│ 201 │ if train_text_encoder: │
│ ❱ 202 │ │ unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prep │
│ 203 │ │ │ unet, text_encoder, optimizer, train_dataloader, lr_scheduler │
│ 204 │ │ ) │
│ 205 │ else: │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py:1139 in prepare │
│ │
│ 1136 │ │ │ if self.device.type == "cpu" and self.state.ipex_plugin is not None: │
│ 1137 │ │ │ │ args = self._prepare_ipex(*args) │
│ 1138 │ │ if self.distributed_type == DistributedType.DEEPSPEED: │
│ ❱ 1139 │ │ │ result = self._prepare_deepspeed(*args) │
│ 1140 │ │ elif self.distributed_type == DistributedType.MEGATRON_LM: │
│ 1141 │ │ │ result = self._prepare_megatron_lm(*args) │
│ 1142 │ │ else: │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py:1446 in │
│ _prepare_deepspeed │
│ │
│ 1443 │ │ │ │ │ │ if type(scheduler).name in deepspeed.runtime.lr_schedules.VA │
│ 1444 │ │ │ │ │ │ │ kwargs["lr_scheduler"] = scheduler │
│ 1445 │ │ │ │
│ ❱ 1446 │ │ │ engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) │
│ 1447 │ │ │ if optimizer is not None: │
│ 1448 │ │ │ │ optimizer = DeepSpeedOptimizerWrapper(optimizer) │
│ 1449 │ │ │ if scheduler is not None: │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/init.py:125 in initialize │
│ │
│ 122 │ assert model is not None, "deepspeed.initialize requires a model" │
│ 123 │ │
│ 124 │ if not isinstance(model, PipelineModule): │
│ ❱ 125 │ │ engine = DeepSpeedEngine(args=args, │
│ 126 │ │ │ │ │ │ │ │ model=model, │
│ 127 │ │ │ │ │ │ │ │ optimizer=optimizer, │
│ 128 │ │ │ │ │ │ │ │ model_parameters=model_parameters, │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py:340 in init │
│ │
│ 337 │ │ │ model_parameters = list(model_parameters) │
│ 338 │ │ │
│ 339 │ │ if has_optimizer: │
│ ❱ 340 │ │ │ self._configure_optimizer(optimizer, model_parameters) │
│ 341 │ │ │ self._configure_lr_scheduler(lr_scheduler) │
│ 342 │ │ │ self._report_progress(0) │
│ 343 │ │ elif self.zero_optimization(): │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1298 in │
│ _configure_optimizer │
│ │
│ 1295 │ │ optimizer_wrapper = self._do_optimizer_sanity_check(basic_optimizer) │
│ 1296 │ │ │
│ 1297 │ │ if optimizer_wrapper == ZERO_OPTIMIZATION: │
│ ❱ 1298 │ │ │ self.optimizer = self._configure_zero_optimizer(basic_optimizer) │
│ 1299 │ │ elif optimizer_wrapper == AMP: │
│ 1300 │ │ │ amp_params = self.amp_params() │
│ 1301 │ │ │ log_dist(f"Initializing AMP with these params: {amp_params}", ranks=[0]) │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1547 in │
│ _configure_zero_optimizer │
│ │
│ 1544 │ │ │ │ │ │ "Pipeline parallelism does not support overlapped communication, │
│ 1545 │ │ │ │ │ ) │
│ 1546 │ │ │ │ │ overlap_comm = False │
│ ❱ 1547 │ │ │ optimizer = DeepSpeedZeroOptimizer( │
│ 1548 │ │ │ │ optimizer, │
│ 1549 │ │ │ │ self.param_names, │
│ 1550 │ │ │ │ timers=timers, │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:527 │
│ in init │
│ │
│ 524 │ │ │
│ 525 │ │ self._link_all_hp_params() │
│ 526 │ │ self._enable_universal_checkpoint() │
│ ❱ 527 │ │ self._param_slice_mappings = self._create_param_mapping() │
│ 528 │ │
│ 529 │ def _enable_universal_checkpoint(self): │
│ 530 │ │ for lp_param_group in self.bit16_groups: │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:539 │
│ in _create_param_mapping │
│ │
│ 536 │ │ │ param_mapping_per_group = OrderedDict() │
│ 537 │ │ │ for lp in self.bit16_groups[i]: │
│ 538 │ │ │ │ if lp._hp_mapping is not None: │
│ ❱ 539 │ │ │ │ │ lp_name = self.param_names[lp] │
│ 540 │ │ │ │ │ param_mapping_per_group[ │
│ 541 │ │ │ │ │ │ lp_name] = lp._hp_mapping.get_hp_fragment_address() │
│ 542 │ │ │ param_mapping.append(param_mapping_per_group) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: Parameter containing:
tensor([[[[-2.5410e-02, 2.5043e-02, 7.1978e-02],
[-1.3399e-02, -1.3034e-01, 1.1476e-01],
[-9.7030e-03, -1.3150e-02, 2.8044e-02]],

     [[ 3.8610e-02, -6.0800e-02,  3.4550e-03],
      [ 1.3344e-01, -1.0869e-01, -3.8528e-02],
      [ 7.1333e-03, -4.9282e-03,  1.3061e-02]],

     [[ 1.6261e-02, -1.8879e-02,  3.4788e-02],
      [-1.9644e-02,  2.3328e-02,  4.0197e-02],
      [-2.4416e-03, -5.7235e-03, -2.1267e-02]],

     [[ 2.1210e-02, -3.1675e-02,  1.7455e-02],
      [ 2.9178e-02, -8.6820e-02,  6.4746e-02],
      [ 3.3720e-03, -2.1977e-02,  2.1647e-02]]],


    [[[ 8.9986e-03, -1.0205e-02, -3.0476e-02],
      [ 7.5455e-03,  1.9113e-02,  8.7913e-02],
      [-9.9675e-05,  3.3088e-03,  1.4712e-02]],

     [[ 1.0064e-03,  8.5808e-03, -7.5712e-03],
      [-5.1672e-03,  5.0153e-02,  9.8676e-03],
      [-9.5510e-03,  1.8238e-02,  2.5396e-02]],

     [[-2.0479e-03,  2.9961e-02,  3.1176e-04],
      [ 1.8082e-02, -1.2043e-01,  6.9264e-03],
      [ 1.6751e-02, -3.0182e-02,  5.0824e-04]],

     [[-2.3328e-03, -2.6728e-02,  1.2321e-02],
      [-2.6235e-02,  4.4914e-02, -5.8993e-03],
      [-1.9181e-02,  1.2548e-02, -2.2108e-02]]],


    [[[-4.5197e-02, -4.5439e-02, -1.7462e-02],
      [ 3.6725e-02,  5.2502e-02, -8.2642e-03],
      [ 1.5603e-02,  2.8736e-02,  3.5283e-02]],

     [[ 2.8639e-03,  4.7068e-02,  2.3455e-02],
      [ 3.3651e-02, -8.0247e-02, -3.7098e-02],
      [-2.3571e-02,  9.5956e-03, -2.3156e-03]],

     [[ 3.7085e-03,  5.3470e-02, -3.7420e-03],
      [-4.5891e-02,  1.0218e-01, -3.4633e-02],
      [ 3.6263e-04, -4.9104e-02, -6.7825e-03]],

     [[-8.8059e-03, -1.0560e-02,  1.6182e-02],
      [-2.8848e-02,  3.9407e-02,  4.2363e-02],
      [-3.3508e-02,  2.1224e-02, -1.5888e-02]]],


    ...,


    [[[-6.4073e-03,  4.9710e-02, -1.2341e-02],
      [ 2.8699e-02,  9.1004e-02, -1.6671e-02],
      [ 1.0349e-02, -2.1209e-02,  1.2168e-02]],

     [[-9.1062e-04,  1.8167e-02,  3.1846e-02],
      [ 1.6944e-02,  8.0092e-02, -2.9738e-02],
      [-9.8830e-03,  4.9885e-02, -1.2700e-02]],

     [[ 1.2020e-02, -6.9198e-03,  1.8985e-02],
      [ 3.4355e-02, -3.0471e-02,  6.8234e-02],
      [ 8.9503e-03, -2.0973e-02,  7.2100e-02]],

     [[ 4.8034e-02,  5.1167e-02,  5.4771e-02],
      [ 4.2249e-02, -6.3657e-02,  1.5984e-02],
      [ 2.4852e-02, -2.7481e-02,  3.8589e-02]]],


    [[[-7.6125e-03,  1.9348e-02,  9.0864e-03],
      [ 5.3428e-02, -5.3440e-02,  3.8909e-02],
      [ 9.1220e-03,  4.8850e-02, -6.4069e-02]],

     [[-3.5119e-02, -2.3940e-02,  2.8393e-03],
      [ 1.6208e-02,  5.4991e-02,  4.0262e-02],
      [ 1.8849e-03,  3.1963e-02, -8.6199e-03]],

     [[ 4.8231e-02, -1.8867e-02,  3.4218e-02],
      [-2.5558e-03,  4.6577e-02, -3.2624e-03],
      [-2.2854e-02, -6.9764e-02, -6.9344e-02]],

     [[-9.7222e-03,  1.2386e-03, -2.0934e-02],
      [ 1.5920e-02, -2.0209e-02,  4.4601e-02],
      [ 2.6301e-02,  3.7617e-02,  1.3492e-03]]],


    [[[-1.0117e-01, -9.8085e-02, -1.1981e-02],
      [ 7.8942e-02, -3.0194e-02,  4.1531e-02],
      [ 4.0931e-02,  2.9909e-02,  4.6317e-02]],

     [[ 1.1388e-01,  4.9862e-02,  1.2630e-02],
      [ 8.5883e-02,  3.8244e-03, -1.8867e-02],
      [-7.1834e-02, -9.0345e-03, -5.3052e-02]],

     [[-5.2167e-03, -4.4715e-02, -2.2235e-02],
      [-8.5760e-03,  2.1861e-02, -1.8662e-02],
      [ 4.5497e-03,  1.9903e-02, -1.7304e-03]],

     [[-2.3223e-02, -4.5166e-02, -8.9723e-03],
      [ 2.2507e-02,  1.0017e-03,  2.8759e-02],
      [ 3.7623e-02,  6.9246e-03,  2.1055e-02]]]], device='cuda:0',
   requires_grad=True)

[02:21:28] ERROR failed (exitcode: 1) local_rank: 0 (pid: 129535) of binary: /home/me/kohya_ss/venv/bin/python3

The text was updated successfully, but these errors were encountered:

congchan · 2023-07-05T11:53:45Z

Same issue here with CodeGen model

zedong-mt · 2023-07-14T02:41:15Z

anybody solve this problem

memray · 2023-09-06T20:54:29Z

Ran into the same issue.

Ting011 · 2023-10-16T03:02:09Z

Same Issue. Appreciate any hint.

mumianyuxin · 2023-10-26T07:18:17Z

same issue，anybody solve this problem

whcjb · 2024-11-07T12:01:16Z

same issue

tjruwase · 2024-11-09T14:42:45Z

@whcjb, can you please share full repro details? Thanks!

me-fraud added bug Something isn't working training labels Jun 8, 2023

me-fraud changed the title ~~[BUG]~~ [BUG] KeyError in stage_1_and_2.py when training dreambooth with deepspeed (in kohya_ss) Jun 8, 2023

tjruwase self-assigned this Nov 9, 2024

traincheck-team mentioned this issue Nov 20, 2024

[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch #6770

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] KeyError in stage_1_and_2.py when training dreambooth with deepspeed (in kohya_ss) #3718

[BUG] KeyError in stage_1_and_2.py when training dreambooth with deepspeed (in kohya_ss) #3718

me-fraud commented Jun 8, 2023 •

edited

Loading

congchan commented Jul 5, 2023

zedong-mt commented Jul 14, 2023

memray commented Sep 6, 2023

Ting011 commented Oct 16, 2023

mumianyuxin commented Oct 26, 2023

whcjb commented Nov 7, 2024

tjruwase commented Nov 9, 2024

[BUG] KeyError in stage_1_and_2.py when training dreambooth with deepspeed (in kohya_ss) #3718

[BUG] KeyError in stage_1_and_2.py when training dreambooth with deepspeed (in kohya_ss) #3718

Comments

me-fraud commented Jun 8, 2023 • edited Loading

congchan commented Jul 5, 2023

zedong-mt commented Jul 14, 2023

memray commented Sep 6, 2023

Ting011 commented Oct 16, 2023

mumianyuxin commented Oct 26, 2023

whcjb commented Nov 7, 2024

tjruwase commented Nov 9, 2024

me-fraud commented Jun 8, 2023 •

edited

Loading