-
Notifications
You must be signed in to change notification settings - Fork 541
convert llama-7B Failed to allocate memory for requested buffer of size 90177536 #1060
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
How much RAM do you have? May be related to #1012 |
77G in total |
Hi @nkjulia, |
failed still. any suggestions? uple(x.clone(memory_format=torch.preserve_format) for x in args) |
same here. optimum-cli export onnx --task 'causal-lm-with-past' --model ./llama-7b --no-post-process llama-7b-onnx i run it in cpu mode with 80g memory. you can try it in your case. |
We might need to work to make the post processing step lighter. Thanks for figuring this out @thedogb |
I am facing the same problem of oom on a 80gb A100. |
@Fire-Hound What error do you get? Could you copy the traceback/log? A simple @thedogb @nkjulia If you are able to provide a log, it help as well to fix. |
What I find is that, for llama-7b with The export itself uses at most 20 GiB RAM, 45 GiB GPU memory. On main (as of 0f2bd69), there is no OOM due to GPU memory on A100-80GB when exporting on cuda device, fp16 for llama-7b. OOM may rather be due to RAM. I'll try to reduce that. |
I suspect pytorch/pytorch#101134 & pytorch/pytorch#101148 to help reduce GPU memory usage. Note below that it is not peak memory, and that RAM is used by other processes as well. On PyTorch 2.0.1, On pytorch 2.0.1:
On 2.1.0.dev20230614+cu118:
|
Sorry for the confusion I am trying it on a 33b model. Do you think 2xA100 80Gb will have enough VRAM for the export? |
I can rerun it today and provide the stack trace, I get a CUDA out-of-memory exception, it's the one you get when you try to allocate more memory than what's available. |
Oh I see! Can you run generation in pytorch, with a small context (a few tokens), batch_size = 2, and min_new_tokens=max_new_tokens=2? If so, maybe the export may succeed on pytorch nightlies. llama-30b is 65 GB in fp 16, so it's already huge. Also, I wouldn't have too high hopes to manage to run llama-30b with ONNX Runtime on CUDAExecutionProvider - which is unfortunately most often more memory intensive than pytorch.
I am a bit surprised that the OOM happens during the post-processing though. If there's one, I would expect it earlier, directly during the The post-processing basically inserts an If node in the ONNX graph to dispatch on two branches depending on whether it is the first pass in the decoder or not (in which case past key values are reused). This allows to use a single ONNX |
Closing as the GPU memory issue is a pytorch one, not optimum: pytorch/pytorch#101134 & pytorch/pytorch#101148 You can try to use pytorch nightly until 2.1 is released that fixes the issue. Following #1115 #1112 #1111 the GPU memory usage should not exceed the one from The post-processing (as describe above) is run on CPU (and effectively uses >x2 RAM than model size), so if GPU OOM arises, it is not at fault. Feel free to reopen an issue if this step is a concern. |
System Info
Who can help?
@michaelbenayoun
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
optimum-cli export onnx --model /data/yahma-llama-7b-hf/ --task causal-lm-with-past --fp16 --for-ort --device cuda llama-onnx
Expected behavior
convert llama to onnx successfuly
The text was updated successfully, but these errors were encountered: