65B model eventually fails with "ggml_new_tensor_impl: not enough space in the scratch memory" #1152

logicchains · 2023-04-24T10:39:58Z

I'm running the 65B model on a machine with 256 gigabytes of (CPU) ram, with context size set to 2048. The same thing happens with both llama65b and alpaca65b, every single time I run it in interactive mode: it works fine for a while, but eventually fails with:

ggml_new_tensor_impl: not enough space in the scratch memory
Segmentation fault (core dumped)

Maybe it's using up more and more ram over time, until it runs out?

The exact params:
llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 4 (mostly Q4_1, some F16)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 146.86 KB
llama_model_load_internal: mem required = 41477.67 MB (+ 5120.00 MB per state)
llama_init_from_file: kv self size = 5120.00 MB

system_info: n_threads = 16 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0
|
main: interactive mode on.
sampling: temp = 1.000000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 2048, n_batch = 8, n_predict = -1, n_keep = 0

ggerganov · 2023-04-24T15:45:46Z

When the contexts swap occurs and it has to re-evaluate the second half of the context (i.e. n_ctx/2 = 1024 tokens), one of the "scratch" buffers runs out of memory.

The solution is:

Apply this patch

diff --git a/llama.cpp b/llama.cpp
index 8c1d657..e860ea1 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -54,7 +54,7 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0()
         { MODEL_7B,    512ull * MB },
         { MODEL_13B,   512ull * MB },
         { MODEL_30B,   512ull * MB },
-        { MODEL_65B,   512ull * MB },
+        { MODEL_65B,  2048ull * MB },
     };
     return _MEM_REQ_SCRATCH0;
 }
@@ -65,7 +65,7 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH1()
         { MODEL_7B,    512ull * MB },
         { MODEL_13B,   512ull * MB },
         { MODEL_30B,   512ull * MB },
-        { MODEL_65B,   512ull * MB },
+        { MODEL_65B,  2048ull * MB },
     };
     return _MEM_REQ_SCRATCH1;
 }
@@ -1290,7 +1290,7 @@ static bool llama_eval_internal(
         mem_per_token = ggml_used_mem(ctx0)/N;
     }
 
-#if 0
+#if 1
     printf("\n%s: used_mem = %.3f MB, scratch -- %.3f MB %.3f MB\n", __func__,
             ggml_used_mem(ctx0)/1024.0/1024.0,
             lctx.get_buf_max_mem(0)/1024.0/1024.0,

Run the main example with a prompt of 1024 tokens (i.e. this should correspond to the worst case scenario)

Write down the output. For example:

llama_eval_internal: used_mem = 508.284 MB, scratch -- 136.000 MB 134.000 MB

The last two numbers are the needed scratch buffer sizes. Revert the patch from the first step and update the respective numbers for the 65B model, putting a bit of extra size on top of the reported number just in case

It's a very sloppy process for determining the necessary scratch buffer size. Will try to improve this in the future.
While doing this, you can also do the same process for the other models and adjust the numbers down since now we are probably over-allocating some memory

P.S. I just bumped the buffers to 1GB for the 65B model to avoid this crash, but the correct solution from above has to be applied and the numbers re-adjusted

Temporary solution

logicchains · 2023-04-24T16:41:07Z

Thanks! Is there some way I can generate a prompt of exactly 1024 tokens? E.g. maybe some character sequence that I could repeat 1024 times?

ggerganov added the bug label Apr 24, 2023

ggerganov added a commit that referenced this issue Apr 24, 2023

llama : increase scratch buffer size for 65B (ref #1152)

957c8ae

Temporary solution

ggerganov mentioned this issue Apr 25, 2023

Have n_batch default to 512 when BLAS is enabled #1091

Merged

saharNooby mentioned this issue Apr 30, 2023

A way to automatically calculate mem_size for creating ggml context ggml-org/ggml#121

Open

sw closed this as completed May 18, 2023

dniku mentioned this issue Jun 19, 2023

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 546644800, available 536870912) Segmentation fault abetlen/llama-cpp-python#356

Open

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

65B model eventually fails with "ggml_new_tensor_impl: not enough space in the scratch memory" #1152

65B model eventually fails with "ggml_new_tensor_impl: not enough space in the scratch memory" #1152

logicchains commented Apr 24, 2023

ggerganov commented Apr 24, 2023 •

edited

Loading

logicchains commented Apr 24, 2023

65B model eventually fails with "ggml_new_tensor_impl: not enough space in the scratch memory" #1152

65B model eventually fails with "ggml_new_tensor_impl: not enough space in the scratch memory" #1152

Comments

logicchains commented Apr 24, 2023

ggerganov commented Apr 24, 2023 • edited Loading

logicchains commented Apr 24, 2023

ggerganov commented Apr 24, 2023 •

edited

Loading