[Bug] dequantize_row_q4_0 segfaults #791

sha0coder · 2023-04-05T20:00:07Z

Environment and Context

Linux 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
g++ (Debian 10.2.1-6) 10.2.1 20210110
GNU Make 4.3

Failure Information (for bugs)

main segfaults at dequantize_row_q4_0+48

Steps to Reproduce

./main -m models/ggml-vocab-q4_0.bin

~/s/llama.cpp ❯❯❯ gdb main
(gdb) r -m models/ggml-vocab-q4_0.bin
Starting program: /home/sha0/soft/llama.cpp/main -m models/ggml-vocab-q4_0.bin
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
main: seed = 1680724006
llama_model_load: loading model from 'models/ggml-vocab-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: type = 1
llama_model_load: ggml map size = 0.41 MB
llama_model_load: ggml ctx size = 81.25 KB
llama_model_load: mem required = 1792.49 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from 'models/ggml-vocab-q4_0.bin'
llama_model_load: model size = 0.00 MB / num tensors = 0
llama_model_load: WARN no tensors loaded from model file - assuming empty model for testing
llama_init_from_file: kv self size = 256.00 MB

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0

[New Thread 0x7fff77560700 (LWP 142639)]
[New Thread 0x7fff76d5f700 (LWP 142640)]
[New Thread 0x7fff7655e700 (LWP 142641)]
[New Thread 0x7fff75d5d700 (LWP 142642)]
[New Thread 0x7fff7555c700 (LWP 142643)]
[New Thread 0x7fff74d5b700 (LWP 142644)]
[New Thread 0x7fff7455a700 (LWP 142645)]
[New Thread 0x7fff73d59700 (LWP 142646)]
[New Thread 0x7fff73558700 (LWP 142647)]
[New Thread 0x7fff72d57700 (LWP 142648)]
[New Thread 0x7fff72556700 (LWP 142649)]
[New Thread 0x7fff71d55700 (LWP 142650)]
[New Thread 0x7fff71554700 (LWP 142651)]
[New Thread 0x7fff70d53700 (LWP 142652)]
[New Thread 0x7fff70552700 (LWP 142653)]

Thread 1 "main" received signal SIGSEGV, Segmentation fault.
0x000055555555e430 in dequantize_row_q4_0 ()
(gdb) bt
#0 0x000055555555e430 in dequantize_row_q4_0 ()
#1 0x0000555555567585 in ggml_compute_forward_get_rows ()
#2 0x000055555556fba3 in ggml_graph_compute ()
#3 0x0000555555578eca in llama_eval_internal(llama_context&, int const*, int, int, int) ()
#4 0x000055555557919f in llama_eval ()
#5 0x000055555555c1aa in main ()
(gdb) x/i $pc
=> 0x55555555e430 <dequantize_row_q4_0+48>: vpmovzxbw 0x4(%rdi),%ymm1
(gdb) i r rdi
rdi 0xa00 2560
(gdb) i r ymm1
ymm1 {v16_bfloat16 = {0x180, 0x0, 0x0, 0x0, 0x180, 0x0 <repeats 11 times>}, v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0xc0, 0x43, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0, 0x43, 0x0 <repeats 22 times>}, v16_int16 = {0x43c0, 0x0, 0x0, 0x0, 0x43c0, 0x0 <repeats 11 times>}, v8_int32 = {0x43c0, 0x0, 0x43c0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0x43c0, 0x43c0, 0x0, 0x0}, v2_int128 = {0x43c000000000000043c0, 0x0}}
(gdb)

sha0coder · 2023-04-05T20:18:40Z

Thread 1 "main" received signal SIGSEGV, Segmentation fault.
dequantize_row_q4_0 (vx=, y=, k=k@entry=4096) at ggml.c:987
987 __m256i vx8 = bytesFromNibbles(pp+l/2);

and without AVX2 the crash is here:
Thread 1 "main" received signal SIGSEGV, Segmentation fault.
dequantize_row_q4_0 (vx=, y=, k=k@entry=4096) at ggml.c:1067
1067 const float d = x[i].d;

sha0coder · 2023-04-05T20:20:54Z

The vx pointer has wrong value:

slaren · 2023-04-05T20:23:27Z

llama_model_load: WARN no tensors loaded from model file - assuming empty model for testing

You cannot eval with a vocab only model.

sha0coder · 2023-04-05T20:25:03Z

where can I get a proper model?

slaren · 2023-04-05T20:30:46Z

I cannot help you with that, but there are some details in the official repository: https://github.com/facebookresearch/llama/

slaren closed this as not planned Won't fix, can't repro, duplicate, stale Apr 5, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] dequantize_row_q4_0 segfaults #791

[Bug] dequantize_row_q4_0 segfaults #791

sha0coder commented Apr 5, 2023

sha0coder commented Apr 5, 2023

sha0coder commented Apr 5, 2023

slaren commented Apr 5, 2023

sha0coder commented Apr 5, 2023

slaren commented Apr 5, 2023

[Bug] dequantize_row_q4_0 segfaults #791

[Bug] dequantize_row_q4_0 segfaults #791

Comments

sha0coder commented Apr 5, 2023

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

sha0coder commented Apr 5, 2023

sha0coder commented Apr 5, 2023

slaren commented Apr 5, 2023

sha0coder commented Apr 5, 2023

slaren commented Apr 5, 2023