cuBLAS: fall back to pageable memory if pinned alloc fails #1233

slaren · 2023-04-29T16:27:34Z

Additionally, adds an environment variable GGML_CUDA_NO_PINNED that can be set to disable all pinned memory usage, which fixes #1231

Priestru · 2023-04-29T16:28:29Z

Sure, give me a minute

Priestru · 2023-04-29T16:33:03Z

Yes you are a wizard. It at least makes it failproof. Yet wonder what my problem is and trying to figure out,

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# ./main -m /media/ggml-vic13b-q5_0.bin -b 512 -t 12
main: seed = 1682785944
llama.cpp: loading model from /media/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 10583.25 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
WARNING: failed to allocate 1024.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory

system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

…is set

cuBLAS: fall back to pageable memory if pinned alloc fails

08e539d

Priestru mentioned this pull request Apr 29, 2023

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

Closed

cuBLAS: do not use pinned memory if env variable GGML_CUDA_NO_PINNED …

476f46f

…is set

This was referenced Apr 29, 2023

System freeze when compiled with cublast #1231

Closed

Generalize quantize_fns for simpler FP16 handling #1237

Merged

cuBLAS: refactor and optimize f16 mat mul performance #1259

Merged

ggerganov approved these changes May 1, 2023

View reviewed changes

slaren merged commit b925f1f into ggml-org:master May 1, 2023

slaren deleted the pinned-fallback branch May 1, 2023 11:32

Celppu mentioned this pull request May 18, 2023

Out of GPU memory when running streaming example abetlen/llama-cpp-python#229

Closed

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuBLAS: fall back to pageable memory if pinned alloc fails #1233

cuBLAS: fall back to pageable memory if pinned alloc fails #1233

slaren commented Apr 29, 2023 •

edited

Loading

Priestru commented Apr 29, 2023

Priestru commented Apr 29, 2023 •

edited

Loading

cuBLAS: fall back to pageable memory if pinned alloc fails #1233

cuBLAS: fall back to pageable memory if pinned alloc fails #1233

Conversation

slaren commented Apr 29, 2023 • edited Loading

Priestru commented Apr 29, 2023

Priestru commented Apr 29, 2023 • edited Loading

slaren commented Apr 29, 2023 •

edited

Loading

Priestru commented Apr 29, 2023 •

edited

Loading