Skip to content

convert llama-7B Failed to allocate memory for requested buffer of size 90177536 #1060

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
1 of 4 tasks
nkjulia opened this issue May 18, 2023 · 14 comments
Closed
1 of 4 tasks
Labels
bug Something isn't working

Comments

@nkjulia
Copy link

nkjulia commented May 18, 2023

System Info

v100 2*C
transformers             4.30.0.dev0
optimum                  1.8.5
onnx                     1.13.1
onnxruntime              1.14.1
onnxruntime-gpu          1.14.1

optimum-cli export onnx --model /data/yahma-llama-7b-hf/ --task causal-lm-with-past --fp16 --for-ort --device cuda llama-onnx

============= Diagnostic Run torch.onnx.export version 2.0.0+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Saving external data to one file...
2023-05-18 09:46:47.397498088 [W:onnxruntime:, session_state.cc:1136 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-05-18 09:46:47.397535341 [W:onnxruntime:, session_state.cc:1138 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2023-05-18 09:46:49.687956512 [E:onnxruntime:, inference_session.cc:1532 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 90177536

Who can help?

@michaelbenayoun

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

optimum-cli export onnx --model /data/yahma-llama-7b-hf/ --task causal-lm-with-past --fp16 --for-ort --device cuda llama-onnx

Expected behavior

convert llama to onnx successfuly

@nkjulia nkjulia added the bug Something isn't working label May 18, 2023
@xenova
Copy link
Contributor

xenova commented May 20, 2023

How much RAM do you have? May be related to #1012

@nkjulia
Copy link
Author

nkjulia commented May 22, 2023

How much RAM do you have? May be related to #1012

77G in total

@michaelbenayoun
Copy link
Member

Hi @nkjulia,
Are you able to perform the export without the --fp16 flag?

@nkjulia
Copy link
Author

nkjulia commented May 26, 2023

Hi @nkjulia, Are you able to perform the export without the --fp16 flag?

failed still. any suggestions?

uple(x.clone(memory_format=torch.preserve_format) for x in args)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 31.75 GiB total capacity; 30.99 GiB already allocated; 48.75 MiB free; 30.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@thedogb
Copy link

thedogb commented Jun 13, 2023

same here.
and i fix it by --no-post-process

optimum-cli export onnx --task 'causal-lm-with-past' --model ./llama-7b --no-post-process  llama-7b-onnx 

i run it in cpu mode with 80g memory. you can try it in your case.

@michaelbenayoun
Copy link
Member

We might need to work to make the post processing step lighter. Thanks for figuring this out @thedogb

@Fire-Hound
Copy link

I am facing the same problem of oom on a 80gb A100.
What's the relevance of post-processing? Will it affect how the model is exported? @michaelbenayoun
Sorry I am new to onnx and trying to figure stuff out.

@fxmarty
Copy link
Contributor

fxmarty commented Jun 15, 2023

@Fire-Hound What error do you get? Could you copy the traceback/log? A simple Killed or an OOM? I can not reproduce on A100-80GB, CUDA_VISIBLE_DEVICES=0 optimum-cli export onnx --model huggingface/llama-7b --fp16 --device cuda llama_7b_onnx runs fine on optimum main + ort 1.15.0 + pytorch 2.0.1.

@thedogb @nkjulia If you are able to provide a log, it help as well to fix.

@fxmarty
Copy link
Contributor

fxmarty commented Jun 15, 2023

What I find is that, for llama-7b with CUDA_VISIBLE_DEVICES=0 optimum-cli export onnx --model huggingface/llama-7b --fp16 --device cuda llama_7b_onnx:

The export itself uses at most 20 GiB RAM, 45 GiB GPU memory.
The post-processing uses at most 33.7 GiB RAM, which is arguably high, and has still ~28 GiB GPU memory allocated which should not maybe not be.
The validation uses at most ~62 GiB GPU RAM, which is arguably high, likely because of ORT.

On main (as of 0f2bd69), there is no OOM due to GPU memory on A100-80GB when exporting on cuda device, fp16 for llama-7b. OOM may rather be due to RAM.

I'll try to reduce that.

@fxmarty
Copy link
Contributor

fxmarty commented Jun 15, 2023

I suspect pytorch/pytorch#101134 & pytorch/pytorch#101148 to help reduce GPU memory usage. Note below that it is not peak memory, and that RAM is used by other processes as well.

On PyTorch 2.0.1, torch.onnx.export call doubles the GPU memory used, which is not the case in nightly.

On pytorch 2.0.1:

----------------- After loading models_and_onnx_configs, before export
RAM: 41925.78 MB
GPU mem: 14623.44 MB
----------------- Just before onnx_export call
RAM: 41925.62 MB
GPU mem: 14623.44 MB
----------------- Just after onnx_export call
RAM: 44580.44 MB
GPU mem: 28523.36 MB
----------------- Just after save external data call
RAM: 44479.48 MB
GPU mem: 28523.36 MB
----------------- Just before onnx_export call
RAM: 44878.93 MB
GPU mem: 28933.36 MB
----------------- Just after onnx_export call
RAM: 46568.39 MB
GPU mem: 28969.01 MB
----------------- Just after save external data call
RAM: 46510.75 MB
GPU mem: 28969.01 MB
----------------- After export, before post-process
RAM: 46503.83 MB
GPU mem: 28969.01 MB
----------------- After post-process
RAM: 48137.57 MB
GPU mem: 28969.01 MB
----------------- After validation
RAM: 49346.93 MB
GPU mem: 65604.16 MB

On 2.1.0.dev20230614+cu118:

----------------- After loading models_and_onnx_configs, before export
RAM: 47260.00 MB
GPU mem: 14772.34 MB
----------------- Just before onnx_export call
RAM: 47256.74 MB
GPU mem: 14772.34 MB
----------------- Just after onnx_export call
RAM: 49899.02 MB
GPU mem: 15577.65 MB
----------------- Just after save external data call
RAM: 44754.64 MB
GPU mem: 15577.65 MB
----------------- Just before onnx_export call
RAM: 45164.67 MB
GPU mem: 15985.54 MB
----------------- Just after onnx_export call
RAM: 45131.51 MB
GPU mem: 16006.51 MB
----------------- Just after save external data call
RAM: 45104.55 MB
GPU mem: 16006.51 MB
----------------- After export, before post-process
RAM: 45131.49 MB
GPU mem: 16006.51 MB
----------------- After post-process
RAM: 46792.34 MB
GPU mem: 16006.51 MB
----------------- After validation
RAM: 47888.27 MB
GPU mem: 52658.44 MB

@Fire-Hound
Copy link

Sorry for the confusion I am trying it on a 33b model. Do you think 2xA100 80Gb will have enough VRAM for the export?

@Fire-Hound
Copy link

@Fire-Hound What error do you get? Could you copy the traceback/log? A simple Killed or an OOM? I can not reproduce on A100-80GB, CUDA_VISIBLE_DEVICES=0 optimum-cli export onnx --model huggingface/llama-7b --fp16 --device cuda llama_7b_onnx runs fine on optimum main + ort 1.15.0 + pytorch 2.0.1.

@thedogb @nkjulia If you are able to provide a log, it help as well to fix.

I can rerun it today and provide the stack trace, I get a CUDA out-of-memory exception, it's the one you get when you try to allocate more memory than what's available.

@fxmarty
Copy link
Contributor

fxmarty commented Jun 15, 2023

Oh I see! Can you run generation in pytorch, with a small context (a few tokens), batch_size = 2, and min_new_tokens=max_new_tokens=2? If so, maybe the export may succeed on pytorch nightlies. llama-30b is 65 GB in fp 16, so it's already huge.

Also, I wouldn't have too high hopes to manage to run llama-30b with ONNX Runtime on CUDAExecutionProvider - which is unfortunately most often more memory intensive than pytorch.

What's the relevance of post-processing? Will it affect how the model is exported? @michaelbenayoun

I am a bit surprised that the OOM happens during the post-processing though. If there's one, I would expect it earlier, directly during the torch.onnx.export call.

The post-processing basically inserts an If node in the ONNX graph to dispatch on two branches depending on whether it is the first pass in the decoder or not (in which case past key values are reused). This allows to use a single ONNX decoder_model_merged.onnx, instead of two decoder_model.onnx and decoder_with_past_model.onnx (that used to duplicate memory).
image

@fxmarty
Copy link
Contributor

fxmarty commented Jun 19, 2023

Closing as the GPU memory issue is a pytorch one, not optimum: pytorch/pytorch#101134 & pytorch/pytorch#101148

You can try to use pytorch nightly until 2.1 is released that fixes the issue. Following #1115 #1112 #1111 the GPU memory usage should not exceed the one from torch.onnx.export, so I think there is not much more we can do on our side. We will probably do a release soon including those fixes.

The post-processing (as describe above) is run on CPU (and effectively uses >x2 RAM than model size), so if GPU OOM arises, it is not at fault. Feel free to reopen an issue if this step is a concern.

@fxmarty fxmarty closed this as completed Jun 19, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants