Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] KeyError in stage_1_and_2.py when training dreambooth with deepspeed (in kohya_ss) #3718

Open
me-fraud opened this issue Jun 8, 2023 · 7 comments
Assignees
Labels
bug Something isn't working training

Comments

@me-fraud
Copy link

me-fraud commented Jun 8, 2023

Hello!

I've encountered an issue trying to run dreambooth training with deepspeed in kohya_ss.

I am running into the error, which seems to occure inside the deepspeed stage_1_and_2.py line 508 - 509:
lp_name = self.param_names[lp]
param_mapping_per_group[lp_name] = lp._hp_mapping.get_hp_fragment_address()

Additionaly i've tried to try - except these lines to see what happens, but ran into issues in the next parts of the code in engine.py (although not sure if it is somehow related).

my configuration is:
1GPU RTX 3060 (12Gb VRAM)
WSL2 Ubuntu 22.04 in Windows 11
Cuda 11.7
Python 3.10.6
Torch 2.0.1+cu117
Accelerate 0.19.0
deepspeed 0.8.3 (although the problem is the same with 0.9.3)
in training settings precision is set to fp16

deepspeed configuration JSON:
{
"zero_optimization": {
"stage": 2
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"overlap_comm": true,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}

console output:

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/me/kohya_ss/train_db.py:482 in │
│ │
│ 479 │ args = parser.parse_args() │
│ 480 │ args = train_util.read_config_from_file(args, parser) │
│ 481 │ │
│ ❱ 482 │ train(args) │
│ 483 │
│ │
│ /home/me/kohya_ss/train_db.py:202 in train │
│ │
│ 199 │ │
│ 200 │ # acceleratorがなんかよろしくやってくれるらしい │
│ 201 │ if train_text_encoder: │
│ ❱ 202 │ │ unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prep │
│ 203 │ │ │ unet, text_encoder, optimizer, train_dataloader, lr_scheduler │
│ 204 │ │ ) │
│ 205 │ else: │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py:1139 in prepare │
│ │
│ 1136 │ │ │ if self.device.type == "cpu" and self.state.ipex_plugin is not None: │
│ 1137 │ │ │ │ args = self._prepare_ipex(*args) │
│ 1138 │ │ if self.distributed_type == DistributedType.DEEPSPEED: │
│ ❱ 1139 │ │ │ result = self._prepare_deepspeed(*args) │
│ 1140 │ │ elif self.distributed_type == DistributedType.MEGATRON_LM: │
│ 1141 │ │ │ result = self._prepare_megatron_lm(*args) │
│ 1142 │ │ else: │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py:1446 in │
│ _prepare_deepspeed │
│ │
│ 1443 │ │ │ │ │ │ if type(scheduler).name in deepspeed.runtime.lr_schedules.VA │
│ 1444 │ │ │ │ │ │ │ kwargs["lr_scheduler"] = scheduler │
│ 1445 │ │ │ │
│ ❱ 1446 │ │ │ engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) │
│ 1447 │ │ │ if optimizer is not None: │
│ 1448 │ │ │ │ optimizer = DeepSpeedOptimizerWrapper(optimizer) │
│ 1449 │ │ │ if scheduler is not None: │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/init.py:125 in initialize │
│ │
│ 122 │ assert model is not None, "deepspeed.initialize requires a model" │
│ 123 │ │
│ 124 │ if not isinstance(model, PipelineModule): │
│ ❱ 125 │ │ engine = DeepSpeedEngine(args=args, │
│ 126 │ │ │ │ │ │ │ │ model=model, │
│ 127 │ │ │ │ │ │ │ │ optimizer=optimizer, │
│ 128 │ │ │ │ │ │ │ │ model_parameters=model_parameters, │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py:340 in init
│ │
│ 337 │ │ │ model_parameters = list(model_parameters) │
│ 338 │ │ │
│ 339 │ │ if has_optimizer: │
│ ❱ 340 │ │ │ self._configure_optimizer(optimizer, model_parameters) │
│ 341 │ │ │ self._configure_lr_scheduler(lr_scheduler) │
│ 342 │ │ │ self._report_progress(0) │
│ 343 │ │ elif self.zero_optimization(): │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1298 in │
│ _configure_optimizer │
│ │
│ 1295 │ │ optimizer_wrapper = self._do_optimizer_sanity_check(basic_optimizer) │
│ 1296 │ │ │
│ 1297 │ │ if optimizer_wrapper == ZERO_OPTIMIZATION: │
│ ❱ 1298 │ │ │ self.optimizer = self._configure_zero_optimizer(basic_optimizer) │
│ 1299 │ │ elif optimizer_wrapper == AMP: │
│ 1300 │ │ │ amp_params = self.amp_params() │
│ 1301 │ │ │ log_dist(f"Initializing AMP with these params: {amp_params}", ranks=[0]) │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1547 in │
│ _configure_zero_optimizer │
│ │
│ 1544 │ │ │ │ │ │ "Pipeline parallelism does not support overlapped communication, │
│ 1545 │ │ │ │ │ ) │
│ 1546 │ │ │ │ │ overlap_comm = False │
│ ❱ 1547 │ │ │ optimizer = DeepSpeedZeroOptimizer( │
│ 1548 │ │ │ │ optimizer, │
│ 1549 │ │ │ │ self.param_names, │
│ 1550 │ │ │ │ timers=timers, │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:527 │
│ in init
│ │
│ 524 │ │ │
│ 525 │ │ self._link_all_hp_params() │
│ 526 │ │ self._enable_universal_checkpoint() │
│ ❱ 527 │ │ self._param_slice_mappings = self._create_param_mapping() │
│ 528 │ │
│ 529 │ def _enable_universal_checkpoint(self): │
│ 530 │ │ for lp_param_group in self.bit16_groups: │
│ │
│ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:539 │
│ in _create_param_mapping │
│ │
│ 536 │ │ │ param_mapping_per_group = OrderedDict() │
│ 537 │ │ │ for lp in self.bit16_groups[i]: │
│ 538 │ │ │ │ if lp._hp_mapping is not None: │
│ ❱ 539 │ │ │ │ │ lp_name = self.param_names[lp] │
│ 540 │ │ │ │ │ param_mapping_per_group[ │
│ 541 │ │ │ │ │ │ lp_name] = lp._hp_mapping.get_hp_fragment_address() │
│ 542 │ │ │ param_mapping.append(param_mapping_per_group) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: Parameter containing:
tensor([[[[-2.5410e-02, 2.5043e-02, 7.1978e-02],
[-1.3399e-02, -1.3034e-01, 1.1476e-01],
[-9.7030e-03, -1.3150e-02, 2.8044e-02]],

     [[ 3.8610e-02, -6.0800e-02,  3.4550e-03],
      [ 1.3344e-01, -1.0869e-01, -3.8528e-02],
      [ 7.1333e-03, -4.9282e-03,  1.3061e-02]],

     [[ 1.6261e-02, -1.8879e-02,  3.4788e-02],
      [-1.9644e-02,  2.3328e-02,  4.0197e-02],
      [-2.4416e-03, -5.7235e-03, -2.1267e-02]],

     [[ 2.1210e-02, -3.1675e-02,  1.7455e-02],
      [ 2.9178e-02, -8.6820e-02,  6.4746e-02],
      [ 3.3720e-03, -2.1977e-02,  2.1647e-02]]],


    [[[ 8.9986e-03, -1.0205e-02, -3.0476e-02],
      [ 7.5455e-03,  1.9113e-02,  8.7913e-02],
      [-9.9675e-05,  3.3088e-03,  1.4712e-02]],

     [[ 1.0064e-03,  8.5808e-03, -7.5712e-03],
      [-5.1672e-03,  5.0153e-02,  9.8676e-03],
      [-9.5510e-03,  1.8238e-02,  2.5396e-02]],

     [[-2.0479e-03,  2.9961e-02,  3.1176e-04],
      [ 1.8082e-02, -1.2043e-01,  6.9264e-03],
      [ 1.6751e-02, -3.0182e-02,  5.0824e-04]],

     [[-2.3328e-03, -2.6728e-02,  1.2321e-02],
      [-2.6235e-02,  4.4914e-02, -5.8993e-03],
      [-1.9181e-02,  1.2548e-02, -2.2108e-02]]],


    [[[-4.5197e-02, -4.5439e-02, -1.7462e-02],
      [ 3.6725e-02,  5.2502e-02, -8.2642e-03],
      [ 1.5603e-02,  2.8736e-02,  3.5283e-02]],

     [[ 2.8639e-03,  4.7068e-02,  2.3455e-02],
      [ 3.3651e-02, -8.0247e-02, -3.7098e-02],
      [-2.3571e-02,  9.5956e-03, -2.3156e-03]],

     [[ 3.7085e-03,  5.3470e-02, -3.7420e-03],
      [-4.5891e-02,  1.0218e-01, -3.4633e-02],
      [ 3.6263e-04, -4.9104e-02, -6.7825e-03]],

     [[-8.8059e-03, -1.0560e-02,  1.6182e-02],
      [-2.8848e-02,  3.9407e-02,  4.2363e-02],
      [-3.3508e-02,  2.1224e-02, -1.5888e-02]]],


    ...,


    [[[-6.4073e-03,  4.9710e-02, -1.2341e-02],
      [ 2.8699e-02,  9.1004e-02, -1.6671e-02],
      [ 1.0349e-02, -2.1209e-02,  1.2168e-02]],

     [[-9.1062e-04,  1.8167e-02,  3.1846e-02],
      [ 1.6944e-02,  8.0092e-02, -2.9738e-02],
      [-9.8830e-03,  4.9885e-02, -1.2700e-02]],

     [[ 1.2020e-02, -6.9198e-03,  1.8985e-02],
      [ 3.4355e-02, -3.0471e-02,  6.8234e-02],
      [ 8.9503e-03, -2.0973e-02,  7.2100e-02]],

     [[ 4.8034e-02,  5.1167e-02,  5.4771e-02],
      [ 4.2249e-02, -6.3657e-02,  1.5984e-02],
      [ 2.4852e-02, -2.7481e-02,  3.8589e-02]]],


    [[[-7.6125e-03,  1.9348e-02,  9.0864e-03],
      [ 5.3428e-02, -5.3440e-02,  3.8909e-02],
      [ 9.1220e-03,  4.8850e-02, -6.4069e-02]],

     [[-3.5119e-02, -2.3940e-02,  2.8393e-03],
      [ 1.6208e-02,  5.4991e-02,  4.0262e-02],
      [ 1.8849e-03,  3.1963e-02, -8.6199e-03]],

     [[ 4.8231e-02, -1.8867e-02,  3.4218e-02],
      [-2.5558e-03,  4.6577e-02, -3.2624e-03],
      [-2.2854e-02, -6.9764e-02, -6.9344e-02]],

     [[-9.7222e-03,  1.2386e-03, -2.0934e-02],
      [ 1.5920e-02, -2.0209e-02,  4.4601e-02],
      [ 2.6301e-02,  3.7617e-02,  1.3492e-03]]],


    [[[-1.0117e-01, -9.8085e-02, -1.1981e-02],
      [ 7.8942e-02, -3.0194e-02,  4.1531e-02],
      [ 4.0931e-02,  2.9909e-02,  4.6317e-02]],

     [[ 1.1388e-01,  4.9862e-02,  1.2630e-02],
      [ 8.5883e-02,  3.8244e-03, -1.8867e-02],
      [-7.1834e-02, -9.0345e-03, -5.3052e-02]],

     [[-5.2167e-03, -4.4715e-02, -2.2235e-02],
      [-8.5760e-03,  2.1861e-02, -1.8662e-02],
      [ 4.5497e-03,  1.9903e-02, -1.7304e-03]],

     [[-2.3223e-02, -4.5166e-02, -8.9723e-03],
      [ 2.2507e-02,  1.0017e-03,  2.8759e-02],
      [ 3.7623e-02,  6.9246e-03,  2.1055e-02]]]], device='cuda:0',
   requires_grad=True)

[02:21:28] ERROR failed (exitcode: 1) local_rank: 0 (pid: 129535) of binary: /home/me/kohya_ss/venv/bin/python3

@me-fraud me-fraud added bug Something isn't working training labels Jun 8, 2023
@me-fraud me-fraud changed the title [BUG] [BUG] KeyError in stage_1_and_2.py when training dreambooth with deepspeed (in kohya_ss) Jun 8, 2023
@congchan
Copy link

congchan commented Jul 5, 2023

Same issue here with CodeGen model

@zedong-mt
Copy link

anybody solve this problem

@memray
Copy link

memray commented Sep 6, 2023

Ran into the same issue.

@Ting011
Copy link

Ting011 commented Oct 16, 2023

Same Issue. Appreciate any hint.

@mumianyuxin
Copy link

same issue,anybody solve this problem

@whcjb
Copy link

whcjb commented Nov 7, 2024

same issue

@tjruwase tjruwase self-assigned this Nov 9, 2024
@tjruwase
Copy link
Contributor

tjruwase commented Nov 9, 2024

@whcjb, can you please share full repro details? Thanks!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

8 participants