convert llama-7B Failed to allocate memory for requested buffer of size 90177536 #1060

nkjulia · 2023-05-18T02:12:32Z

System Info

v100 2*C
transformers             4.30.0.dev0
optimum                  1.8.5
onnx                     1.13.1
onnxruntime              1.14.1
onnxruntime-gpu          1.14.1

optimum-cli export onnx --model /data/yahma-llama-7b-hf/ --task causal-lm-with-past --fp16 --for-ort --device cuda llama-onnx

============= Diagnostic Run torch.onnx.export version 2.0.0+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Saving external data to one file...
2023-05-18 09:46:47.397498088 [W:onnxruntime:, session_state.cc:1136 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-05-18 09:46:47.397535341 [W:onnxruntime:, session_state.cc:1138 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2023-05-18 09:46:49.687956512 [E:onnxruntime:, inference_session.cc:1532 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 90177536

Who can help?

@michaelbenayoun

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

optimum-cli export onnx --model /data/yahma-llama-7b-hf/ --task causal-lm-with-past --fp16 --for-ort --device cuda llama-onnx

Expected behavior

convert llama to onnx successfuly

The text was updated successfully, but these errors were encountered:

xenova · 2023-05-20T13:57:46Z

How much RAM do you have? May be related to #1012

nkjulia · 2023-05-22T08:41:01Z

How much RAM do you have? May be related to #1012

77G in total

michaelbenayoun · 2023-05-24T08:35:01Z

Hi @nkjulia,
Are you able to perform the export without the --fp16 flag?

nkjulia · 2023-05-26T06:11:52Z

Hi @nkjulia, Are you able to perform the export without the --fp16 flag?

failed still. any suggestions?

uple(x.clone(memory_format=torch.preserve_format) for x in args)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 31.75 GiB total capacity; 30.99 GiB already allocated; 48.75 MiB free; 30.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

thedogb · 2023-06-13T09:03:41Z

same here.
and i fix it by --no-post-process

optimum-cli export onnx --task 'causal-lm-with-past' --model ./llama-7b --no-post-process  llama-7b-onnx

i run it in cpu mode with 80g memory. you can try it in your case.

michaelbenayoun · 2023-06-14T08:36:21Z

We might need to work to make the post processing step lighter. Thanks for figuring this out @thedogb

Fire-Hound · 2023-06-14T23:43:41Z

I am facing the same problem of oom on a 80gb A100.
What's the relevance of post-processing? Will it affect how the model is exported? @michaelbenayoun
Sorry I am new to onnx and trying to figure stuff out.

fxmarty · 2023-06-15T06:27:01Z

@Fire-Hound What error do you get? Could you copy the traceback/log? A simple Killed or an OOM? I can not reproduce on A100-80GB, CUDA_VISIBLE_DEVICES=0 optimum-cli export onnx --model huggingface/llama-7b --fp16 --device cuda llama_7b_onnx runs fine on optimum main + ort 1.15.0 + pytorch 2.0.1.

@thedogb @nkjulia If you are able to provide a log, it help as well to fix.

fxmarty · 2023-06-15T06:52:38Z

What I find is that, for llama-7b with CUDA_VISIBLE_DEVICES=0 optimum-cli export onnx --model huggingface/llama-7b --fp16 --device cuda llama_7b_onnx:

The export itself uses at most 20 GiB RAM, 45 GiB GPU memory.
The post-processing uses at most 33.7 GiB RAM, which is arguably high, and has still ~28 GiB GPU memory allocated which should not maybe not be.
The validation uses at most ~62 GiB GPU RAM, which is arguably high, likely because of ORT.

On main (as of 0f2bd69), there is no OOM due to GPU memory on A100-80GB when exporting on cuda device, fp16 for llama-7b. OOM may rather be due to RAM.

I'll try to reduce that.

fxmarty · 2023-06-15T09:04:16Z

I suspect pytorch/pytorch#101134 & pytorch/pytorch#101148 to help reduce GPU memory usage. Note below that it is not peak memory, and that RAM is used by other processes as well.

On PyTorch 2.0.1, torch.onnx.export call doubles the GPU memory used, which is not the case in nightly.

On pytorch 2.0.1:

----------------- After loading models_and_onnx_configs, before export
RAM: 41925.78 MB
GPU mem: 14623.44 MB
----------------- Just before onnx_export call
RAM: 41925.62 MB
GPU mem: 14623.44 MB
----------------- Just after onnx_export call
RAM: 44580.44 MB
GPU mem: 28523.36 MB
----------------- Just after save external data call
RAM: 44479.48 MB
GPU mem: 28523.36 MB
----------------- Just before onnx_export call
RAM: 44878.93 MB
GPU mem: 28933.36 MB
----------------- Just after onnx_export call
RAM: 46568.39 MB
GPU mem: 28969.01 MB
----------------- Just after save external data call
RAM: 46510.75 MB
GPU mem: 28969.01 MB
----------------- After export, before post-process
RAM: 46503.83 MB
GPU mem: 28969.01 MB
----------------- After post-process
RAM: 48137.57 MB
GPU mem: 28969.01 MB
----------------- After validation
RAM: 49346.93 MB
GPU mem: 65604.16 MB

On 2.1.0.dev20230614+cu118:

----------------- After loading models_and_onnx_configs, before export
RAM: 47260.00 MB
GPU mem: 14772.34 MB
----------------- Just before onnx_export call
RAM: 47256.74 MB
GPU mem: 14772.34 MB
----------------- Just after onnx_export call
RAM: 49899.02 MB
GPU mem: 15577.65 MB
----------------- Just after save external data call
RAM: 44754.64 MB
GPU mem: 15577.65 MB
----------------- Just before onnx_export call
RAM: 45164.67 MB
GPU mem: 15985.54 MB
----------------- Just after onnx_export call
RAM: 45131.51 MB
GPU mem: 16006.51 MB
----------------- Just after save external data call
RAM: 45104.55 MB
GPU mem: 16006.51 MB
----------------- After export, before post-process
RAM: 45131.49 MB
GPU mem: 16006.51 MB
----------------- After post-process
RAM: 46792.34 MB
GPU mem: 16006.51 MB
----------------- After validation
RAM: 47888.27 MB
GPU mem: 52658.44 MB

Fire-Hound · 2023-06-15T10:30:29Z

Sorry for the confusion I am trying it on a 33b model. Do you think 2xA100 80Gb will have enough VRAM for the export?

Fire-Hound · 2023-06-15T10:34:30Z

@Fire-Hound What error do you get? Could you copy the traceback/log? A simple Killed or an OOM? I can not reproduce on A100-80GB, CUDA_VISIBLE_DEVICES=0 optimum-cli export onnx --model huggingface/llama-7b --fp16 --device cuda llama_7b_onnx runs fine on optimum main + ort 1.15.0 + pytorch 2.0.1.

@thedogb @nkjulia If you are able to provide a log, it help as well to fix.

I can rerun it today and provide the stack trace, I get a CUDA out-of-memory exception, it's the one you get when you try to allocate more memory than what's available.

fxmarty · 2023-06-15T12:47:51Z

Oh I see! Can you run generation in pytorch, with a small context (a few tokens), batch_size = 2, and min_new_tokens=max_new_tokens=2? If so, maybe the export may succeed on pytorch nightlies. llama-30b is 65 GB in fp 16, so it's already huge.

Also, I wouldn't have too high hopes to manage to run llama-30b with ONNX Runtime on CUDAExecutionProvider - which is unfortunately most often more memory intensive than pytorch.

What's the relevance of post-processing? Will it affect how the model is exported? @michaelbenayoun

I am a bit surprised that the OOM happens during the post-processing though. If there's one, I would expect it earlier, directly during the torch.onnx.export call.

The post-processing basically inserts an If node in the ONNX graph to dispatch on two branches depending on whether it is the first pass in the decoder or not (in which case past key values are reused). This allows to use a single ONNX decoder_model_merged.onnx, instead of two decoder_model.onnx and decoder_with_past_model.onnx (that used to duplicate memory).

fxmarty · 2023-06-19T04:55:06Z

Closing as the GPU memory issue is a pytorch one, not optimum: pytorch/pytorch#101134 & pytorch/pytorch#101148

You can try to use pytorch nightly until 2.1 is released that fixes the issue. Following #1115 #1112 #1111 the GPU memory usage should not exceed the one from torch.onnx.export, so I think there is not much more we can do on our side. We will probably do a release soon including those fixes.

The post-processing (as describe above) is run on CPU (and effectively uses >x2 RAM than model size), so if GPU OOM arises, it is not at fault. Feel free to reopen an issue if this step is a concern.

nkjulia added the bug Something isn't working label May 18, 2023

xenova mentioned this issue May 24, 2023

Support for transformers webmachinelearning/webnn#375

Closed

fxmarty mentioned this issue Jun 16, 2023

Lower GPU memory requirements at ONNX export #1115

Merged

fxmarty closed this as completed Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert llama-7B Failed to allocate memory for requested buffer of size 90177536 #1060

convert llama-7B Failed to allocate memory for requested buffer of size 90177536 #1060

nkjulia commented May 18, 2023

xenova commented May 20, 2023

Uh oh!

nkjulia commented May 22, 2023

Uh oh!

michaelbenayoun commented May 24, 2023

Uh oh!

nkjulia commented May 26, 2023 •

edited

Loading

Uh oh!

thedogb commented Jun 13, 2023

Uh oh!

michaelbenayoun commented Jun 14, 2023

Uh oh!

Fire-Hound commented Jun 14, 2023

Uh oh!

fxmarty commented Jun 15, 2023 •

edited

Loading

Uh oh!

fxmarty commented Jun 15, 2023 •

edited

Loading

Uh oh!

fxmarty commented Jun 15, 2023 •

edited

Loading

Uh oh!

Fire-Hound commented Jun 15, 2023

Uh oh!

Fire-Hound commented Jun 15, 2023

Uh oh!

fxmarty commented Jun 15, 2023 •

edited

Loading

Uh oh!

fxmarty commented Jun 19, 2023 •

edited

Loading

Uh oh!

convert llama-7B Failed to allocate memory for requested buffer of size 90177536 #1060

convert llama-7B Failed to allocate memory for requested buffer of size 90177536 #1060

Comments

nkjulia commented May 18, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

xenova commented May 20, 2023

Uh oh!

nkjulia commented May 22, 2023

Uh oh!

michaelbenayoun commented May 24, 2023

Uh oh!

nkjulia commented May 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thedogb commented Jun 13, 2023

Uh oh!

michaelbenayoun commented Jun 14, 2023

Uh oh!

Fire-Hound commented Jun 14, 2023

Uh oh!

fxmarty commented Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxmarty commented Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxmarty commented Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fire-Hound commented Jun 15, 2023

Uh oh!

Fire-Hound commented Jun 15, 2023

Uh oh!

fxmarty commented Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxmarty commented Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nkjulia commented May 26, 2023 •

edited

Loading

fxmarty commented Jun 15, 2023 •

edited

Loading

fxmarty commented Jun 15, 2023 •

edited

Loading

fxmarty commented Jun 15, 2023 •

edited

Loading

fxmarty commented Jun 15, 2023 •

edited

Loading

fxmarty commented Jun 19, 2023 •

edited

Loading