-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[BUG] DS Inference Bloom OOM / get_sd_loader_json() missing 1 argument #2222
Comments
For case 2: For some reason, 556f005 removed this argument in |
I was able to retest on 8 A100. Same issue unfortunately. Works just fine with accelerate, though. Will try to rebuild from scratch and test again. Doesnt work with |
Error seems to be on my side for case #1. Was able to run https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/scripts/inference/bloom-ds-inference.py |
Hi @oborchers, |
Hi @RezaYazdaniAminabadi, |
Hi @asafkar, |
Hi @RezaYazdaniAminabadi , Currently it seems that the engine/_load_checkpoint function does not work in this case, and perhaps can be skipped, if adding a call to module_inject/load_model_with_checkpoint function later on (by passing a checkpoint to _apply_injection_policy function) Thanks |
Describe the bug
When loading the bloom model the loader tries to allocate too much to a single GPU. Therefore, the script OOMs.
I think this is also related to: #2169
Hardware
5x A100 80G + 512G ram (should be enough as it is working with plain accelerate).
To Reproduce
Copied and mildly adapted from
Run with:
(Deepspeed it built from main branch)
Case 1:
Results in
Note, that this is CUDA device 2 out of 5!
Case 2:
Based on #2169
results in:
Tagging @RezaYazdaniAminabadi because I know of his active involvement in this one.
The text was updated successfully, but these errors were encountered: