Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] DS Inference Bloom OOM / get_sd_loader_json() missing 1 argument #2222

Closed
oborchers opened this issue Aug 16, 2022 · 7 comments
Closed
Labels
bug Something isn't working inference

Comments

@oborchers
Copy link

oborchers commented Aug 16, 2022

Describe the bug

When loading the bloom model the loader tries to allocate too much to a single GPU. Therefore, the script OOMs.

I think this is also related to: #2169

Hardware

5x A100 80G + 512G ram (should be enough as it is working with plain accelerate).

To Reproduce

Copied and mildly adapted from

Run with:

deepspeed --num_gpus 5 script.py

(Deepspeed it built from main branch)

Case 1:

dtype = torch.float16
dschf = HfDeepSpeedConfig(ds_config)

with deepspeed.OnDevice(dtype=dtype, device='meta'):
    model = AutoModelForCausalLM.from_config(self.config, torch_dtype=torch.bfloat16)
model = model.eval()

ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()
model = ds_engine.module

dist.barrier()

model = deepspeed.init_inference(model, 
                         mp_size=1,
                         dtype=dtype,
                         checkpoint='/workspace/bloom-176B.json', 
                         replace_with_kernel_inject=True
                         )
self.model = model.module

Results in

RuntimeError: CUDA out of memory. Tried to allocate 1.53 GiB (GPU 2; 79.17 GiB total capacity; 76.57 GiB already allocated; 932.00 MiB free; 77.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Note, that this is CUDA device 2 out of 5!

Case 2:

Based on #2169

injection_policy={transformers.models.bloom.modeling_bloom.BloomBlock: ('self_attention.dense', 'mlp.dense_4h_to_h')}
replace_with_kernel_inject=False

results in:

TypeErrorTypeError: : get_sd_loader_json() missing 1 required positional argument: 'checkpoint_engine'get_sd_loader_json() missing 1 required positional argument: 'checkpoint_engine'

    model.load()
  File "/workspace/bloom.py", line 106, in load
    injection_policy={transformers.models.bloom.modeling_bloom.BloomBlock: ('self_attention.dense', 'mlp.dense_4h_to_h')}
  File "/opt/conda/lib/python3.7/site-packages/deepspeed/__init__.py", line 312, in init_inference
    save_mp_checkpoint_path)
  File "/opt/conda/lib/python3.7/site-packages/deepspeed/inference/engine.py", line 104, in __init__
    self._load_checkpoint(self.checkpoint)
  File "/opt/conda/lib/python3.7/site-packages/deepspeed/inference/engine.py", line 394, in _load_checkpoint
    sd_loader = SDLoaderFactory.get_sd_loader_json(load_dir)
TypeError: get_sd_loader_json() missing 1 required positional argument: 'checkpoint_engine'

Tagging @RezaYazdaniAminabadi because I know of his active involvement in this one.

@oborchers oborchers added the bug Something isn't working label Aug 16, 2022
@felix-schneider
Copy link

For case 2: For some reason, 556f005 removed this argument in get_sd_load_json. I didn't dive deep into the code, but it does seem like it's required. Maybe one of the authors of this commit @RezaYazdaniAminabadi @jeffra @tjruwase can help with this

@oborchers
Copy link
Author

I was able to retest on 8 A100. Same issue unfortunately. Works just fine with accelerate, though. Will try to rebuild from scratch and test again. Doesnt work with deepspeed x.py, not python x.py

@oborchers
Copy link
Author

Error seems to be on my side for case #1. Was able to run https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/scripts/inference/bloom-ds-inference.py

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @oborchers,
Thanks for the update and happy to see the issue of case #1 is resolved. Unfortunately, the case #2 does not work for this model right now. We will try to bring this support back after fixing some issues regarding the checkpoint loading.
Thanks,
Reza

@asafkar
Copy link

asafkar commented Feb 1, 2023

Hi @RezaYazdaniAminabadi,
do you have an ETA regarding case #2 ?
Currently (without such a fix) as far as I understand, there is no way to use DS inference without kernel injection, right?

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @asafkar,
I am afraid it is not easy to do that wo kernels since it requires some changes on the BLOOM attention side to deal with the alibi tensor. Can I ask why this feature is important and why you cannot use the kernel_injected method?
Thanks,
Reza

@asafkar
Copy link

asafkar commented Feb 2, 2023

Hi @RezaYazdaniAminabadi ,
I need this feature in order to run on none-GPU accelerators (mainly for using the TP/PP capabilities).
Currently, this would not work for CPUs.

Currently it seems that the engine/_load_checkpoint function does not work in this case, and perhaps can be skipped, if adding a call to module_inject/load_model_with_checkpoint function later on (by passing a checkpoint to _apply_injection_policy function)

Thanks

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working inference
Projects
None yet
Development

No branches or pull requests

5 participants