Serve Llama 3.2 models as downloaded from Meta #10934

corbsmartin · 2024-12-05T21:51:00Z

corbsmartin
Dec 5, 2024

Hi all,

Without modifying vllm code is it possible to deploy a Llama 3.2 model in the format Meta provides when you download via llama stack? For example when you download Llama 3.2 1b using llama stack the files are:

checklist.chk - 156B
consolidated.00.pth - 2.3G (torch format?)
params.json - 220B
tokenizer.model - 2.1M

params.json content

{
  "dim": 2048,
  "ffn_dim_multiplier": 1.5,
  "multiple_of": 256,
  "n_heads": 32,
  "n_kv_heads": 8,
  "n_layers": 16,
  "norm_eps": 1e-05,
  "rope_theta": 500000.0,
  "use_scaled_rope": true,
  "vocab_size": 128256
}

vllm serve seems not to be able to load and serve these models without first running the HF transformer script.

Questions

Is it possible to serve llama models as provided by meta? Or is conversion to HF format required by vllm?
If it’s possible what engine args would enable this?

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serve Llama 3.2 models as downloaded from Meta #10934

{{title}}

Replies: 0 comments

Select a reply

Serve Llama 3.2 models as downloaded from Meta #10934

corbsmartin Dec 5, 2024

Replies: 0 comments

corbsmartin
Dec 5, 2024