Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

CPU OOM when inferencing Llama3-70B-Chinese-Chat #904

Open
GORGEOUSLCX opened this issue May 20, 2024 · 0 comments
Open

CPU OOM when inferencing Llama3-70B-Chinese-Chat #904

GORGEOUSLCX opened this issue May 20, 2024 · 0 comments

Comments

@GORGEOUSLCX
Copy link

Code: text-generation demo
Command:
deepspeed --num_gpus 2 inference-test.py --dtype float16 --batch_size 4 --max_new_tokens 200 --model ../Llama3-70B-Chinese-Chat
Hardware: two A100 80GB GPUs, CPU 250GB
Problem: When using Deepspeed to load the float16 model, it consumes too much CPU memory, and 250GB of memory cannot load the 70B model. When I use the built-in model of Transformers for inference, Model=AutoModelForCausalLM. from_pretrained (model_id, torch dtype=torch. float16, device_map="auto"), can perform inference without occupying CPU memory.
How to reduce CPU memory usage?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant