-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[BUG] Zero-Inference usage error with .init_inference() #2372
Comments
@joaopcm1996, thanks for reporting this issue and apologies for the confusion. ZeRO-Inference requires
You might find the zero-inference script for BLOOM-176b useful: bloom-ds-zero-inference.py |
@joaopcm1996 I confirmed that following the steps indicated by @tjruwase and implementing inference script similar to hugging face bloom example I can use zero inference. Please let us know if the issue is resolved and if we can close this issue. |
@joaopcm1996, please re-open if this is still an issue. |
@lokoppakmsft thanks a lot for testing it and sorry for the delayed response. I will test it myself, if there is any issue I will refer back here as @jeffra suggested. |
The command to run this code is
And I get the log
The e2e generation time is about 2000 second. This is too slow. |
Describe the bug
Passing Zero level-3 cpu offload parameters to the
args
parameter of .init_inference() does not have any effect, still consuming all GPU memory for a large model and throwing an error.Following the blog post on Zero-Inference, I tried to load a GPT-J model with DeepSpeed Inference, and got a CUDA OOM error.
How should I pass in the deepspeed config parameters to the init_inference method? Or should I just .initialize() even if using the model for inference?
To Reproduce
Error :
RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 14.56 GiB total capacity; 13.52 GiB already allocated; 52.44 MiB free; 13.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Expected behavior
The first layer of the model being loaded , and then layers dynamically loaded as they are needed for inference.
ds_report output
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
The text was updated successfully, but these errors were encountered: