Skip to content

Add checks for buffer size with Metal #1706

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 1 commit into from
Jun 6, 2023

Conversation

spencersutton
Copy link
Contributor

@spencersutton spencersutton commented Jun 5, 2023

If llama.cpp tries to allocate a Metal buffer that's bigger than the maximum then it only puts out a message that it failed to allocate. This results in the error 'ggml_metal_get_buffer: error: buffer is nil' being given endlessly.

This PR adds a check for the maximum buffer size and adds a check for a false return value from ggml_metal_add_buffer generally. It also propagates the error from llama_init_from_file by returning NULL.

Existing behavior:

main: build = 622 (f4c55d3)
main: seed = 1
llama.cpp: loading model from /Users/spencer/ai/models/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 17452.67 MB
llama_model_load_internal: mem required = 2532.68 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 780.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/spencer/ai/repos/llama.cpp/build/bin/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x141e08850
ggml_metal_init: loaded kernel_mul 0x141e08e50
ggml_metal_init: loaded kernel_mul_row 0x141e09480
ggml_metal_init: loaded kernel_scale 0x141e099a0
ggml_metal_init: loaded kernel_silu 0x141e09ec0
ggml_metal_init: loaded kernel_relu 0x141e0a3e0
ggml_metal_init: loaded kernel_soft_max 0x141e0aa90
ggml_metal_init: loaded kernel_diag_mask_inf 0x141e0b0f0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x141e0b770
ggml_metal_init: loaded kernel_rms_norm 0x141e0be20
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x141e0c680
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x141e0d050
ggml_metal_init: loaded kernel_rope 0x141e0d940
ggml_metal_init: loaded kernel_cpy_f32_f16 0x141e0e1d0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x141e0ea60
ggml_metal_add_buffer: failed to allocate 'data ' buffer, size = 17452.67 MB
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1280.00 MB
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 782.00 MB
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 512.00 MB
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB

system_info: n_threads = 6 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0

USER: Write a one paragraph summary of what happened in 1918. ASSISTANT:Inggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
...

New behavior:

main: build = 622 (827fd74)
main: seed = 1
llama.cpp: loading model from /Users/spencer/ai/models/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 17452.67 MB
llama_model_load_internal: mem required = 2532.68 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 780.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/spencer/ai/repos/llama.cpp/build/bin/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x11de07b40
ggml_metal_init: loaded kernel_mul 0x11de08140
ggml_metal_init: loaded kernel_mul_row 0x11de08660
ggml_metal_init: loaded kernel_scale 0x11de08b80
ggml_metal_init: loaded kernel_silu 0x11de090a0
ggml_metal_init: loaded kernel_relu 0x11de095c0
ggml_metal_init: loaded kernel_soft_max 0x11de09c70
ggml_metal_init: loaded kernel_diag_mask_inf 0x11de0a2d0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x11de0a950
ggml_metal_init: loaded kernel_rms_norm 0x11de0b000
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x11de0b860
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x11de0c230
ggml_metal_init: loaded kernel_rope 0x11de0cb20
ggml_metal_init: loaded kernel_cpy_f32_f16 0x11de0d3b0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x11de0dc40
ggml_metal_add_buffer: buffer 'data' size 18300452864 is larger than buffer maximum of 17179869184
llama_init_from_file: failed to add buffer
llama_init_from_gpt_params: error: failed to load model '/Users/spencer/ai/models/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin'
main: error: unable to load model

@ggerganov ggerganov merged commit 590250f into ggml-org:master Jun 6, 2023
@johnrtipton
Copy link

Thanks for making this more clear, I was getting the nil error myself. It looks like we have very similar machines. I have an M1 Max with 32GB of ram. My question is if I have 32GB of ram, why can it not fit the 18GB model in ram and say that the buffer maximum is 17GB? Is there a way to run 30B models on my M1 Max with 32GB? Will future changes allow me to use the full 32GB?

@madhatter215
Copy link

@johnrtipton I believe MTLDevice.maxBufferLength always returns 1/2 the size of total RAM
It's called in ggml-metal.m#L222 which refers to MTLDevice.maxBufferLength. For example, on my Mac mini with a measly 8GB total RAM, the buffer maximum is 4294967296 (4GiB) which is exactly 1/2 total RAM.

@davidliudev
Copy link

Interesting. So we could only utilise half of the system RAM for inference?
Is this a system limit or something we can optimize in the future?

@ggerganov
Copy link
Member

See discussion in #1696 (comment) for potential fix

@dbl001
Copy link

dbl001 commented Jun 10, 2023

This is on an iMac 27" w/128GB RAM and an AMD Radeon Pro 5700 XT (16GB) build with 'MPS' (e.g. LLAMA_METAL=1 make).

 % ./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -ngl 1
main: build = 656 (303f580)
main: seed  = 1686415745
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/davidlaxer/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                         0x7fb0b1f05c50
ggml_metal_init: loaded kernel_mul                         0x7fb0b1f06b20
ggml_metal_init: loaded kernel_mul_row                     0x7fb0b1f078d0
ggml_metal_init: loaded kernel_scale                       0x7fb0b2807d30
ggml_metal_init: loaded kernel_silu                        0x7fb0b2808ae0
ggml_metal_init: loaded kernel_relu                        0x7fb0b2809890
ggml_metal_init: loaded kernel_gelu                        0x7fb082306030
ggml_metal_init: loaded kernel_soft_max                    0x7fb082306aa0
ggml_metal_init: loaded kernel_diag_mask_inf               0x7fb0b280a640
ggml_metal_init: loaded kernel_get_rows_f16                0x7fb0b1f08340
ggml_metal_init: loaded kernel_get_rows_q4_0               0x7fb0b1f090f0
ggml_metal_init: loaded kernel_get_rows_q4_1               0x7fb0b1f0a010
ggml_metal_init: loaded kernel_get_rows_q2_k               0x7fb0b1f0af40
ggml_metal_init: loaded kernel_get_rows_q4_k               0x7fb0823079d0
ggml_metal_init: loaded kernel_get_rows_q6_k               0x7fb082308900
ggml_metal_init: loaded kernel_rms_norm                    0x7fb0823096b0
ggml_metal_init: loaded kernel_mul_mat_f16_f32             0x7fb0b280b3f0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32            0x7fb08230a460
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32            0x7fb08230b210
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32            0x7fb08230c250
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32            0x7fb08230d000
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32            0x7fb08230ddb0
ggml_metal_init: loaded kernel_rope                        0x7fb08230ece0
ggml_metal_init: loaded kernel_cpy_f32_f16                 0x7fb08230fa90
ggml_metal_init: loaded kernel_cpy_f32_f32                 0x7fb0b280c1a0
ggml_metal_add_buffer: buffer 'data' size 3791728640 is larger than buffer maximum of 3758096384
llama_init_from_file: failed to add buffer
llama_init_from_gpt_params: error: failed to load model './models/7B/ggml-model-q4_0.bin'
main: error: unable to load model

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants