Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

sysext: add llamaedge recipe #103

Merged
merged 1 commit into from
Dec 3, 2024
Merged

Conversation

hydai
Copy link
Contributor

@hydai hydai commented Nov 28, 2024

Add LlamaEdge sysext

This PR creates a sysext for running LlamaEdge on Flatcar. It will allow users to deploy their own LLM on the cluster.

How to use

Run create_llamaedge_sysext.sh for building the .raw file.

Then, use the following config:

variant: flatcar
version: 1.0.0
storage:
  files:
    - path: /opt/extensions/wasmedge-0.14.1-x86-64.raw
      mode: 0420
      contents:
        source: https://github.com/flatcar/sysext-bakery/releases/download/latest/wasmaedge-0.14.1-x86-64.raw
    - path: /opt/extensions/llamaedge-0.14.16-x86-64.raw
      mode: 0420
      contents:
        source: https://github.com/flatcar/sysext-bakery/releases/download/latest/llamaedge-0.14.16-x86-64.raw
  links:
    - target: /opt/extensions/llamaedge-0.14.16-x86-64.raw
      path: /etc/extensions/llamaedge.raw
      hard: false
    - target: /opt/extensions/wasmedge-0.14.1-x86-64.raw
      path: /etc/extensions/wasmedge.raw
      hard: false

Testing done

I've verified the behavior on my Digital Ocean instance.

Configuration

Yaml

variant: flatcar
version: 1.0.0
storage:
  files:
    - path: /opt/extensions/wasmedge-0.14.1-x86-64.raw
      mode: 0420
      contents:
        source: https://github.com/second-state/flatcar-sysext-bakery/releases/download/0.0.3/wasmedge-0.14.1-x86-64.raw
    - path: /opt/extensions/llamaedge-0.14.16-x86-64.raw
      mode: 0420
      contents:
        source: https://github.com/second-state/flatcar-sysext-bakery/releases/download/0.0.3/llamaedge-0.14.16-x86-64.raw
  links:
    - target: /opt/extensions/llamaedge-0.14.16-x86-64.raw
      path: /etc/extensions/llamaedge.raw
      hard: false
    - target: /opt/extensions/wasmedge-0.14.1-x86-64.raw
      path: /etc/extensions/wasmedge.raw
      hard: false

JSON

{
   "ignition":{
      "version":"3.3.0"
   },
   "storage":{
      "files":[
         {
            "path":"/opt/extensions/wasmedge-0.14.1-x86-64.raw",
            "contents":{
               "source":"https://github.com/second-state/flatcar-sysext-bakery/releases/download/0.0.3/wasmedge-0.14.1-x86-64.raw"
            },
            "mode":272
         },
         {
            "path":"/opt/extensions/llamaedge-0.14.16-x86-64.raw",
            "contents":{
               "source":"https://github.com/second-state/flatcar-sysext-bakery/releases/download/0.0.3/llamaedge-0.14.16-x86-64.raw"
            },
            "mode":272
         }
      ],
      "links":[
         {
            "path":"/etc/extensions/llamaedge.raw",
            "hard":false,
            "target":"/opt/extensions/llamaedge-0.14.16-x86-64.raw"
         },
         {
            "path":"/etc/extensions/wasmedge.raw",
            "hard":false,
            "target":"/opt/extensions/wasmedge-0.14.1-x86-64.raw"
         }
      ]
   }
}

Prepare the model

Depending on the hardware used, I chose a smaller model due to the limitations of my Digital Ocean instance.

wget https://huggingface.co/second-state/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q2_K.gguf

Start the server

The WASM is provided inside the sysext image. Please use the following path, /usr/lib/wasmedge/wasm/llama-api-server.wasm.

You can also reduce the CONTEXT_SIZE if running on a small memory instance.

MODEL_FILE="Llama-3.2-1B-Instruct-Q2_K.gguf"
API_SERVER_WASM="/usr/lib/wasmedge/wasm/llama-api-server.wasm"
PROMPT_TEMPLATE="llama-3-chat"
CONTEXT_SIZE=128
MODEL_NAME="llama-3.2-1B"

wasmedge \
  --dir .:. \
  --nn-preload default:GGML:AUTO:${MODEL_FILE} \
  ${API_SERVER_WASM} \
  --prompt-template ${PROMPT_TEMPLATE} \
  --ctx-size ${CONTEXT_SIZE} \
  --model-name ${MODEL_NAME}

It will start to load the model into memory and start the OpenAI compatible API server.

The expected output should be:

..omitted..
[2024-11-28 09:09:08.909] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
[2024-11-28 09:09:08.920] [info] llama_core in crates/llama-core/src/lib.rs:128: running mode: chat
[2024-11-28 09:09:08.923] [info] llama_core in crates/llama-core/src/lib.rs:140: The core context has been initialized
[2024-11-28 09:09:08.923] [info] llama_core in crates/llama-core/src/lib.rs:230: Getting the plugin info
[2024-11-28 09:09:08.923] [info] llama_core in crates/llama-core/src/lib.rs:418: Get the running mode.
[2024-11-28 09:09:08.923] [info] llama_core in crates/llama-core/src/lib.rs:443: running mode: chat
[2024-11-28 09:09:08.923] [info] llama_core in crates/llama-core/src/lib.rs:312: Getting the plugin info by the graph named llama-3.2-1B
[2024-11-28 09:09:08.923] [info] llama_core::utils in crates/llama-core/src/utils.rs:175: Get the output buffer generated by the model named llama-3.2-1B
[2024-11-28 09:09:08.924] [info] llama_core::utils in crates/llama-core/src/utils.rs:193: Output buffer size: 95
[2024-11-28 09:09:08.924] [info] llama_core in crates/llama-core/src/lib.rs:372: Plugin info: b4067(commit 54ef9cfc)
[2024-11-28 09:09:08.924] [info] llama_api_server in llama-api-server/src/main.rs:459: plugin_ggml_version: b4067 (commit 54ef9cfc)
[2024-11-28 09:09:08.930] [info] llama_api_server in llama-api-server/src/main.rs:504: Listening on 0.0.0.0:8080

Interact with the API server

Please check the llamaedge document for more option details: https://github.com/LlamaEdge/LlamaEdge/tree/main/llama-api-server

Get model list

curl -X GET http://localhost:8080/v1/models -H 'accept:application/json'

Expected output:

{
   "object":"list",
   "data":[
      {
         "id":"llama-3.2-1B",
         "created":1732784948,
         "object":"model",
         "owned_by":"Not specified"
      }
   ]
}

Chat completion

curl -X POST http://localhost:8080/v1/chat/completions \
    -H 'accept:application/json' \
    -H 'Content-Type: application/json' \
    -d '{"messages":[{"role":"system", "content": "You are a helpful assistant. Reply in short sentence"}, {"role":"user", "content": "What is the capital of Japan?"}], "model":"llama-3.2-1B"}'

Expected output:

{
   "id":"chatcmpl-cdf8f57f-70ec-4cb3-b1f3-e60054f64981",
   "object":"chat.completion",
   "created":1732785197,
   "model":"llama-3.2-1B",
   "choices":[
      {
         "index":0,
         "message":{
            "content":"The capital of Japan is Tokyo.",
            "role":"assistant"
         },
         "finish_reason":"stop",
         "logprobs":null
      }
   ],
   "usage":{
      "prompt_tokens":33,
      "completion_tokens":9,
      "total_tokens":42
   }
}

Signed-off-by: hydai <hydai@secondstate.io>
Copy link
Contributor

@tormath1 tormath1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution, that's exciting to see this running on Flatcar :)

@tormath1 tormath1 merged commit dd38a27 into flatcar:main Dec 3, 2024
@hydai hydai deleted the add_llamaedge branch December 3, 2024 14:56
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants