-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Add checks for buffer size with Metal #1706
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Thanks for making this more clear, I was getting the nil error myself. It looks like we have very similar machines. I have an M1 Max with 32GB of ram. My question is if I have 32GB of ram, why can it not fit the 18GB model in ram and say that the buffer maximum is 17GB? Is there a way to run 30B models on my M1 Max with 32GB? Will future changes allow me to use the full 32GB? |
@johnrtipton I believe MTLDevice.maxBufferLength always returns 1/2 the size of total RAM |
Interesting. So we could only utilise half of the system RAM for inference? |
See discussion in #1696 (comment) for potential fix |
This is on an iMac 27" w/128GB RAM and an AMD Radeon Pro 5700 XT (16GB) build with 'MPS' (e.g. LLAMA_METAL=1 make).
|
If llama.cpp tries to allocate a Metal buffer that's bigger than the maximum then it only puts out a message that it failed to allocate. This results in the error 'ggml_metal_get_buffer: error: buffer is nil' being given endlessly.
This PR adds a check for the maximum buffer size and adds a check for a false return value from ggml_metal_add_buffer generally. It also propagates the error from llama_init_from_file by returning NULL.
Existing behavior:
main: build = 622 (f4c55d3)
main: seed = 1
llama.cpp: loading model from /Users/spencer/ai/models/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 17452.67 MB
llama_model_load_internal: mem required = 2532.68 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 780.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/spencer/ai/repos/llama.cpp/build/bin/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x141e08850
ggml_metal_init: loaded kernel_mul 0x141e08e50
ggml_metal_init: loaded kernel_mul_row 0x141e09480
ggml_metal_init: loaded kernel_scale 0x141e099a0
ggml_metal_init: loaded kernel_silu 0x141e09ec0
ggml_metal_init: loaded kernel_relu 0x141e0a3e0
ggml_metal_init: loaded kernel_soft_max 0x141e0aa90
ggml_metal_init: loaded kernel_diag_mask_inf 0x141e0b0f0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x141e0b770
ggml_metal_init: loaded kernel_rms_norm 0x141e0be20
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x141e0c680
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x141e0d050
ggml_metal_init: loaded kernel_rope 0x141e0d940
ggml_metal_init: loaded kernel_cpy_f32_f16 0x141e0e1d0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x141e0ea60
ggml_metal_add_buffer: failed to allocate 'data ' buffer, size = 17452.67 MB
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1280.00 MB
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 782.00 MB
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 512.00 MB
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB
system_info: n_threads = 6 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0
USER: Write a one paragraph summary of what happened in 1918. ASSISTANT:Inggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
...
New behavior:
main: build = 622 (827fd74)
main: seed = 1
llama.cpp: loading model from /Users/spencer/ai/models/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 17452.67 MB
llama_model_load_internal: mem required = 2532.68 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 780.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/spencer/ai/repos/llama.cpp/build/bin/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x11de07b40
ggml_metal_init: loaded kernel_mul 0x11de08140
ggml_metal_init: loaded kernel_mul_row 0x11de08660
ggml_metal_init: loaded kernel_scale 0x11de08b80
ggml_metal_init: loaded kernel_silu 0x11de090a0
ggml_metal_init: loaded kernel_relu 0x11de095c0
ggml_metal_init: loaded kernel_soft_max 0x11de09c70
ggml_metal_init: loaded kernel_diag_mask_inf 0x11de0a2d0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x11de0a950
ggml_metal_init: loaded kernel_rms_norm 0x11de0b000
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x11de0b860
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x11de0c230
ggml_metal_init: loaded kernel_rope 0x11de0cb20
ggml_metal_init: loaded kernel_cpy_f32_f16 0x11de0d3b0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x11de0dc40
ggml_metal_add_buffer: buffer 'data' size 18300452864 is larger than buffer maximum of 17179869184
llama_init_from_file: failed to add buffer
llama_init_from_gpt_params: error: failed to load model '/Users/spencer/ai/models/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin'
main: error: unable to load model