Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

SYCL Hangs after ggml_backend_sycl_host_buffer_type #6943

Closed
jwhitehorn opened this issue Apr 27, 2024 · 2 comments
Closed

SYCL Hangs after ggml_backend_sycl_host_buffer_type #6943

jwhitehorn opened this issue Apr 27, 2024 · 2 comments

Comments

@jwhitehorn
Copy link

There is a chance I'm doing something incorrect here, and if so I'd love to better understand what. But, as of currently, I cannot get llama.cpp to successfully run with SYCL on my A770. It detects the GPU, and begins to load, but just "hangs" with the last log being "[SYCL] call ggml_backend_sycl_host_buffer_type".

When it is like this, a single CPU core is pegged at 100% from the process.

I've left it like this for hours, and it never progresses.

The full logs are:

Log start
main: build = 0 (unknown)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308) for x86_64-unknown-linux-gnu
main: seed  = 1714194721
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /models/llama-2-7b.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 1
ggml_init_sycl: GGML_SYCL_F16: no
[SYCL] call ggml_backend_sycl_print_sycl_devices
found 4 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       1.3|        512|    1024|     32|    16225243136|
| 1|    [opencl:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       3.0|        512|    1024|     32|    16225243136|
| 2|    [opencl:cpu:0]|      Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz|       3.0|         48|    8192|     64|    67334434816|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         48|67108864|     64|    67334434816|
[SYCL] call ggml_backend_sycl_set_mul_device_mode
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
[SYCL] call ggml_backend_sycl_host_buffer_type
[SYCL] call ggml_backend_sycl_get_device_count
[SYCL] call ggml_backend_sycl_get_device_memory
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
[SYCL] call ggml_backend_sycl_buffer_type
llm_load_tensors: ggml ctx size =    0.30 MiB
[SYCL] call ggml_backend_sycl_host_buffer_type

Steps the reproduce:

Intel Arc A770
Debian 12 / 6.8.7-zabbly+ kernel

Running release b2749 in Docker, using the following Dockerfile:

ARG ONEAPI_VERSION=2024.1.0-devel-ubuntu22.04
FROM intel/oneapi-basekit:$ONEAPI_VERSION as runtime

WORKDIR /app

RUN apt-get update && \
    apt-get install -y git python3 libpython3.11 python3-pip python3-venv vim

RUN wget https://github.com/ggerganov/llama.cpp/archive/refs/tags/b2749.tar.gz && \
    tar -xvzf b2749.tar.gz && \
    cd /app/llama.cpp-b2749 && \
    mkdir build && \
    cd build && \
    if [ "${LLAMA_SYCL_F16}" = "ON" ]; then \
        echo "LLAMA_SYCL_F16 is set" && \
        export OPT_SYCL_F16="-DLLAMA_SYCL_F16=ON"; \
    fi && \
    cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx ${OPT_SYCL_F16} && \
    cmake --build . --config Release --target main

Executed with the following command:

docker run -it -v /mnt/storage/home/jason/llama/models/:/models/ \
    --device /dev/dri:/dev/dri \
    -e NEOReadDebugKeys=1 -e OverrideGpuAddressSpace=48 -e GGML_SYCL_DEVICE=0 \
    --entrypoint /bin/bash my-image:ver1

I then run the following command:

/app/llama.cpp-b2749/build/bin/main -m /models/llama-2-7b.Q4_0.gguf -i -ngl -1

@arthw
Copy link
Collaborator

arthw commented Apr 28, 2024

@jwhitehorn
Could I know some info about your case?

  1. Does Arc770 work well on with Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz for other workload? like gaming or show screen.
  2. what's your GPU driver version?
    run:
dpkg -l | grep zero

It would be the cause.

  1. Have you run llama.cpp for SYCL succuessfully in same machine?

@jwhitehorn
Copy link
Author

Thank you @arthw!

Turns out, the issue I was hitting is a Linux kernel-level regression. Your question about the driver version was helpful, as it ultimately led me to this open issue: intel/compute-runtime#726

Downgrading from 6.8.7 kernel to 6.8.4 resolved the issue I was experiencing.

Marking this one as closed.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

2 participants