You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Code: text-generation demo
Command:
deepspeed --num_gpus 2 inference-test.py --dtype float16 --batch_size 4 --max_new_tokens 200 --model ../Llama3-70B-Chinese-Chat
Hardware: two A100 80GB GPUs, CPU 250GB
Problem: When using Deepspeed to load the float16 model, it consumes too much CPU memory, and 250GB of memory cannot load the 70B model. When I use the built-in model of Transformers for inference, Model=AutoModelForCausalLM. from_pretrained (model_id, torch dtype=torch. float16, device_map="auto"), can perform inference without occupying CPU memory. How to reduce CPU memory usage?
The text was updated successfully, but these errors were encountered:
Code: text-generation demo
Command:
deepspeed --num_gpus 2 inference-test.py --dtype float16 --batch_size 4 --max_new_tokens 200 --model ../Llama3-70B-Chinese-Chat
Hardware: two A100 80GB GPUs, CPU 250GB
Problem: When using Deepspeed to load the float16 model, it consumes too much CPU memory, and 250GB of memory cannot load the 70B model. When I use the built-in model of Transformers for inference, Model=AutoModelForCausalLM. from_pretrained (model_id, torch dtype=torch. float16, device_map="auto"), can perform inference without occupying CPU memory.
How to reduce CPU memory usage?
The text was updated successfully, but these errors were encountered: