-
Notifications
You must be signed in to change notification settings - Fork 11.5k
65B model eventually fails with "ggml_new_tensor_impl: not enough space in the scratch memory" #1152
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
When the contexts swap occurs and it has to re-evaluate the second half of the context (i.e. The solution is:
diff --git a/llama.cpp b/llama.cpp
index 8c1d657..e860ea1 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -54,7 +54,7 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0()
{ MODEL_7B, 512ull * MB },
{ MODEL_13B, 512ull * MB },
{ MODEL_30B, 512ull * MB },
- { MODEL_65B, 512ull * MB },
+ { MODEL_65B, 2048ull * MB },
};
return _MEM_REQ_SCRATCH0;
}
@@ -65,7 +65,7 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH1()
{ MODEL_7B, 512ull * MB },
{ MODEL_13B, 512ull * MB },
{ MODEL_30B, 512ull * MB },
- { MODEL_65B, 512ull * MB },
+ { MODEL_65B, 2048ull * MB },
};
return _MEM_REQ_SCRATCH1;
}
@@ -1290,7 +1290,7 @@ static bool llama_eval_internal(
mem_per_token = ggml_used_mem(ctx0)/N;
}
-#if 0
+#if 1
printf("\n%s: used_mem = %.3f MB, scratch -- %.3f MB %.3f MB\n", __func__,
ggml_used_mem(ctx0)/1024.0/1024.0,
lctx.get_buf_max_mem(0)/1024.0/1024.0,
It's a very sloppy process for determining the necessary scratch buffer size. Will try to improve this in the future. P.S. I just bumped the buffers to |
Thanks! Is there some way I can generate a prompt of exactly 1024 tokens? E.g. maybe some character sequence that I could repeat 1024 times? |
I'm running the 65B model on a machine with 256 gigabytes of (CPU) ram, with context size set to 2048. The same thing happens with both llama65b and alpaca65b, every single time I run it in interactive mode: it works fine for a while, but eventually fails with:
ggml_new_tensor_impl: not enough space in the scratch memory
Segmentation fault (core dumped)
Maybe it's using up more and more ram over time, until it runs out?
The exact params:
llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 4 (mostly Q4_1, some F16)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 146.86 KB
llama_model_load_internal: mem required = 41477.67 MB (+ 5120.00 MB per state)
llama_init_from_file: kv self size = 5120.00 MB
system_info: n_threads = 16 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0
|
main: interactive mode on.
sampling: temp = 1.000000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 2048, n_batch = 8, n_predict = -1, n_keep = 0
The text was updated successfully, but these errors were encountered: