llama cpp server not doing parallel inference for llava when using flags -np and -cb #5592

tellsiddh · 2024-02-19T18:16:43Z

When I am trying to do parallel inferencing on llama cpp server for multimodal, I am getting the correct output for slot 0, but for other slots, I am not. Does that mean that clip is only being loaded on one slot? I can see some clip layers failing to load.

Here is my llama cpp server code that I use.

./server -m ../models/llava13b1_5/llava13b1_5_f16.gguf -c 40960 --n-gpu-layers 41 --port 8001 --mmproj ../models/llava13b1_5/llava13b1_5_mmproj_f16.gguf -np 10 -cb --host 0.0.0.0 --threads 24

The model I am using -
https://huggingface.co/mys/ggml_llava-v1.5-13b/tree/main

I am using the F16 model with mmproj file.

Documentation reference

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

My GPU specs

My CPU specs

Loading llama cpp server for llava, using slot 0 for inference.

ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
{"timestamp":1708365483,"level":"INFO","function":"main","line":2536,"message":"build info","build":2167,"commit":"5bf2b94d"}
{"timestamp":1708365483,"level":"INFO","function":"main","line":2539,"message":"system info","n_threads":24,"n_threads_batch":-1,"total_threads":28,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "}

llama server listening at http://0.0.0.0:8001

{"timestamp":1708365483,"level":"INFO","function":"main","line":2643,"message":"HTTP server listening","port":"8001","hostname":"0.0.0.0"}
Multi Modal Mode Enabledclip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 2
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 377 tensors from ../models/llava13b1_5/llava13b1_5_mmproj_f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  17:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  235 tensors
clip_model_load: - type  f16:  142 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     615.49 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: params backend buffer size =  615.49 MB (377 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
llama_model_loader: loaded meta data with 18 key-value pairs and 363 tensors from ../models/llava13b1_5/llava13b1_5_f16.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 24.24 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =   312.50 MiB
llm_load_tensors:      CUDA0 buffer size = 24514.08 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 40960
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size = 32000.00 MiB
llama_new_context_with_model: KV self size  = 32000.00 MiB, K (f16): 16000.00 MiB, V (f16): 16000.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    91.16 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3320.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    10.00 MiB
llama_new_context_with_model: graph splits (measure): 3
Available slots:
 -> Slot 0 - max context: 4096
 -> Slot 1 - max context: 4096
 -> Slot 2 - max context: 4096
 -> Slot 3 - max context: 4096
 -> Slot 4 - max context: 4096
 -> Slot 5 - max context: 4096
 -> Slot 6 - max context: 4096
 -> Slot 7 - max context: 4096
 -> Slot 8 - max context: 4096
 -> Slot 9 - max context: 4096
{"timestamp":1708365486,"level":"INFO","function":"main","line":2664,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 - loaded image
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0 - encoding image [id: 1]

print_timings: prompt eval time =     349.34 ms /     1 tokens (  349.34 ms per token,     2.86 tokens per second)
print_timings:        eval time =    1599.23 ms /    72 runs   (   22.21 ms per token,    45.02 tokens per second)
print_timings:       total time =    1948.57 ms
slot 0 released (73 tokens in cache)

When using the other slot, that is parallel inferencing -

slot 1 is processing [task id: 74]
slot 1 : kv cache rm - [0, end)
slot 1 - encoding image [id: 1]

print_timings: prompt eval time =     278.78 ms /     1 tokens (  278.78 ms per token,     3.59 tokens per second)
print_timings:        eval time =    2573.45 ms /   113 runs   (   22.77 ms per token,    43.91 tokens per second)
print_timings:       total time =    2852.24 ms
slot 1 released (114 tokens in cache)

Prompt
model_type parameter in my payload is only for a proxy server that is rerouting all the requests.

Image looks like this

The text was updated successfully, but these errors were encountered:

jacob-hansen · 2024-03-04T18:43:34Z

How did you get the response? I'm struggling to figure out how to post a request to the llama cpp server running. Would you be able to provide an example? E.g. what url (/v1/chat/completions??)

tellsiddh · 2024-03-04T19:34:15Z

How did you get the response? I'm struggling to figure out how to post a request to the llama cpp server running. Would you be able to provide an example? E.g. what url (/v1/chat/completions??)

the url i used is host:port/completions

What errors are you getting?

jacob-hansen · 2024-03-04T22:24:08Z

I got it to work! Thank you. For reference to anyone who finds this thread:

import requests
import json
import base64
from PIL import Image
import io

def get_model_response(question, image, temperature = 0.1):
    url = 'http://localhost:8001/completion'
    data = {
            "model_type": "llama_cpp",
            "prompt": question,
            "temperature": temperature,
            "image_data": [
                {
                    "data": image,
                    "id": 1
                }
            ]
            }
    response = requests.post(url, json=data)
    if response.status_code != 200:
        print(response)
        raise Exception(f"Encountered an error: {response.text}")
    return json.loads(response.text)

question = "The assistant gives helpful, detailed and polite responses to the questions.\nUSER: [img-1] What is this image showing?\nASSISTANT: "
img_path = "./images/lake.jpg"
with open(img_path, 'rb') as image_file:
    image = Image.open(image_file)
    image = image.resize((256, 256))
    buffered = io.BytesIO()
    image_format = 'PNG'
    image.save(buffered, format=image_format)
    img_str = base64.b64encode(buffered.getvalue()).decode('utf-8')

response = get_model_response(question, img_str)

The part I was messing up was the PNG buffered format and the endpoint I was posting to.

KohakuBlueleaf · 2024-03-06T03:11:29Z

@ggerganov Any updates here?
I can confirm I met same problem. Have tried to set different or same img_id accross request.

When setting to same id. The effect is basically as same as what you posted in the PR for llava and batch processing of server. #3677

After setting to different id. It looks like only 1 slot (in my case, slot-0 too) have image data include in inference.
Other slot's result are basically just randomly output some string.

Here is some example:
server command:

./server -m ./models/LLaVA-1.5-14B/ggml-model-q5_k.gguf --mmproj ./models/LLaVA-1.5-14B/mmproj-model-f16.gguf --ctx_size 8192 -ngl 99 -t 8 -np 4 -cb

Inference code

import asyncio
import base64
import copy

from httpx import AsyncClient
from objprint import objprint


client = AsyncClient(timeout=3600)
URL = "http://127.0.0.1:8080/completion"
DATA = {
    "image_data": [],
    "n_predict": 400,
    "prompt": "",
    "repeat_last_n": 128,
    "repeat_penalty": 1.2,
    "slot_id": -1,
    "stop": ["</s>", "ASSISTANT:", "USER:"],
    "top_k": 40,
    "top_p": 0.9,
    "temperature": 0.1,
}
SLOTS = 4
rq_count = 0


def construct_data(prompt, image, slot_id):
    if slot_id == -1:
        slot_id = rq_count % SLOTS
    img_id = 10+slot_id
    prompt = prompt.replace("<img>", f"[img-{img_id}]")
    img_str = base64.b64encode(open(image, "rb").read()).decode("utf-8")
    data = copy.deepcopy(DATA)
    data["image_data"] = [{
        "id": img_id,
        "data": img_str
    }]
    data["prompt"] = "prompt"
    data["slot_id"] = slot_id
    return data


async def rq_img(image):
    global rq_count
    data = construct_data(
        "USER: <img> Describe this Image with short sentence.\nASSISTANT:",
        "./test.jpg",
        -1
    )
    rq_count += 1
    resp = await client.post(URL, json=data)
    try:
        resp = resp.json()
    except:
        resp = resp.text
    return resp, data["slot_id"]


async def main():
    image = "./test.jpg"
    result = await asyncio.gather(*(rq_img(image) for _ in range(4)))
    for res in result:
        print(f"slot={res[1]}")
        objprint(res[0]["content"])
        print("\n\n")


asyncio.run(main())

And the result:

slot=0
's:
1. A woman with white hair and a blue flower in her hair, wearing only wings on top of her body. She is looking at the camera.'



slot=1
's, and the ability to create custom prompts.

### Prompt Engineering

Prompts are an essential part of GPT-3's functionality. They provide a starting point for generating text based on specific inputs or contexts. In this section, we will explore some techniques used in prompt engineering:

1. **Multiple Choice**: Providing multiple choice options can help guide the model towards more focused and relevant responses. For example, asking "What is your favorite color?" with a list of choices (e.g., red, blue, green) might lead to better results than an open-ended question like "Tell me about your day."
2. **Contextualization**: Including context in prompts can help the model understand the situation and generate more relevant responses. For example, asking "What is the capital of France?" with a brief explanation (e.g., "The country that borders Germany") might improve accuracy compared to an uncontextualized question.
3. **Prompt Design**: The way prompts are phrased can influence how GPT-3 interprets and responds to them. For example, asking "What is the best way to invest in stocks?" may yield more useful results than a vague or ambiguous query like "Tell me about investing."
4. **Iterative Prompting**: Repeating prompts with slight variations can help GPT-3 learn and adapt its responses over time. For example, asking the same question multiple times (e.g., "What is your favorite color?") may lead to more accurate or diverse answers as the model becomes better acquainted with the user's preferences.
5. **Custom Prompts**: Creating custom prompts tailored to specific tasks can improve GPT-3's performance in those areas. For example, a developer might create a series of prompts designed to help'



slot=2
's, and the user's response.

Here is an example of how you can use this function to create a simple quiz:
```python
def ask_and_answer(prompt):
    print(f"{prompt} ")
    answer = input("Your Answer: ")
    return answer

print("Welcome to the Quiz!")
question1_prompt = "What is the capital of France?"
user_response1 = ask_and_answer(question1_prompt)
if user_response1.lower() == 'paris':
    print("Correct! The correct answer is Paris.")
else:
    print("Incorrect. The correct answer is Paris.")

question2_prompt = "What is the largest planet in our solar system?"
user_response2 = ask_and_answer(question2_prompt)
if user_response2.lower() == 'jupiter':
    print("Correct! The correct answer is Jupiter.")
else:
    print("Incorrect. The correct answer is Jupiter.")

print("Thanks for playing the quiz!")
```'



slot=3
's, and the user can choose to accept or decline each prompt. If a prompt is accepted, it will be added to the list of completed tasks in the app.

The app also includes a feature that allows users to track their progress over time by viewing their completed tasks on a calendar or timeline. This helps them stay motivated and accountable for completing their daily goals. Additionally, there is an option to set reminders for specific tasks, ensuring that the user stays focused and organized throughout the day.

In summary, this app provides users with a simple yet effective way to manage their time by setting daily goals, creating prompts or to-do lists, tracking progress over time, and receiving reminder notifications when necessary.'

KohakuBlueleaf · 2024-03-06T03:21:48Z

Server log is here:

{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":979,"tid":"0x1ef64fac0","timestamp":1709694377}
encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in   644.12 ms by CLIP (    1.12 ms per image patch)
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":1,"task_id":980,"tid":"0x1ef64fac0","timestamp":1709694385}
encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in   299.43 ms by CLIP (    0.52 ms per image patch)
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":2,"task_id":981,"tid":"0x1ef64fac0","timestamp":1709694392}
encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in   283.34 ms by CLIP (    0.49 ms per image patch)
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":3,"task_id":982,"tid":"0x1ef64fac0","timestamp":1709694399}
encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in   297.94 ms by CLIP (    0.52 ms per image patch)
{"function":"print_timings","level":"INFO","line":260,"msg":"prompt eval time     =   28814.78 ms /     1 tokens (28814.78 ms per token,     0.03 tokens per second)","n_prompt_tokens_processed":1,"n_tokens_second":0.03470440608874923,"slot_id":0,"t_prompt_processing":28814.785,"t_token":28814.785,"task_id":979,"tid":"0x1ef64fac0","timestamp":1709694417}
{"function":"print_timings","level":"INFO","line":274,"msg":"generation eval time =   11045.20 ms /    36 runs   (  306.81 ms per token,     3.26 tokens per second)","n_decoded":36,"n_tokens_second":3.2593346665822858,"slot_id":0,"t_token":306.81108333333333,"t_token_generation":11045.199,"task_id":979,"tid":"0x1ef64fac0","timestamp":1709694417}
{"function":"print_timings","level":"INFO","line":283,"msg":"          total time =   39859.98 ms","slot_id":0,"t_prompt_processing":28814.785,"t_token_generation":11045.199,"t_total":39859.984,"task_id":979,"tid":"0x1ef64fac0","timestamp":1709694417}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":58720,"status":200,"tid":"0x16e0bf000","timestamp":1709694417}
{"function":"update_slots","level":"INFO","line":1644,"msg":"slot released","n_cache_tokens":37,"n_ctx":8192,"n_past":613,"n_system_tokens":0,"slot_id":0,"task_id":979,"tid":"0x1ef64fac0","timestamp":1709694417,"truncated":false}
{"function":"print_timings","level":"INFO","line":260,"msg":"prompt eval time     =    7034.96 ms /     1 tokens ( 7034.96 ms per token,     0.14 tokens per second)","n_prompt_tokens_processed":1,"n_tokens_second":0.14214717862015458,"slot_id":3,"t_prompt_processing":7034.962,"t_token":7034.962,"task_id":982,"tid":"0x1ef64fac0","timestamp":1709694458}
{"function":"print_timings","level":"INFO","line":274,"msg":"generation eval time =   51638.63 ms /   156 runs   (  331.02 ms per token,     3.02 tokens per second)","n_decoded":156,"n_tokens_second":3.0209941080738565,"slot_id":3,"t_token":331.0168653846154,"t_token_generation":51638.631,"task_id":982,"tid":"0x1ef64fac0","timestamp":1709694458}
{"function":"print_timings","level":"INFO","line":283,"msg":"          total time =   58673.59 ms","slot_id":3,"t_prompt_processing":7034.962,"t_token_generation":51638.631,"t_total":58673.593,"task_id":982,"tid":"0x1ef64fac0","timestamp":1709694458}
{"function":"update_slots","level":"INFO","line":1644,"msg":"slot released","n_cache_tokens":157,"n_ctx":8192,"n_past":733,"n_system_tokens":0,"slot_id":3,"task_id":982,"tid":"0x1ef64fac0","timestamp":1709694458,"truncated":false}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":58723,"status":200,"tid":"0x16e37b000","timestamp":1709694458}
{"function":"print_timings","level":"INFO","line":260,"msg":"prompt eval time     =   13974.84 ms /     1 tokens (13974.84 ms per token,     0.07 tokens per second)","n_prompt_tokens_processed":1,"n_tokens_second":0.0715571596444525,"slot_id":2,"t_prompt_processing":13974.842,"t_token":13974.842,"task_id":981,"tid":"0x1ef64fac0","timestamp":1709694492}
{"function":"print_timings","level":"INFO","line":274,"msg":"generation eval time =   85326.83 ms /   256 runs   (  333.31 ms per token,     3.00 tokens per second)","n_decoded":256,"n_tokens_second":3.000228544753325,"slot_id":2,"t_token":333.30794140625,"t_token_generation":85326.833,"task_id":981,"tid":"0x1ef64fac0","timestamp":1709694492}
{"function":"print_timings","level":"INFO","line":283,"msg":"          total time =   99301.68 ms","slot_id":2,"t_prompt_processing":13974.842,"t_token_generation":85326.833,"t_total":99301.675,"task_id":981,"tid":"0x1ef64fac0","timestamp":1709694492}
{"function":"update_slots","level":"INFO","line":1644,"msg":"slot released","n_cache_tokens":257,"n_ctx":8192,"n_past":833,"n_system_tokens":0,"slot_id":2,"task_id":981,"tid":"0x1ef64fac0","timestamp":1709694492,"truncated":false}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":58722,"status":200,"tid":"0x16e2ef000","timestamp":1709694492}
{"function":"print_timings","level":"INFO","line":260,"msg":"prompt eval time     =   21135.36 ms /     1 tokens (21135.36 ms per token,     0.05 tokens per second)","n_prompt_tokens_processed":1,"n_tokens_second":0.04731407013516021,"slot_id":1,"t_prompt_processing":21135.362,"t_token":21135.362,"task_id":980,"tid":"0x1ef64fac0","timestamp":1709694513}
{"function":"print_timings","level":"INFO","line":274,"msg":"generation eval time =  106386.09 ms /   400 runs   (  265.97 ms per token,     3.76 tokens per second)","n_decoded":400,"n_tokens_second":3.7598899322301462,"slot_id":1,"t_token":265.96523249999996,"t_token_generation":106386.093,"task_id":980,"tid":"0x1ef64fac0","timestamp":1709694513}
{"function":"print_timings","level":"INFO","line":283,"msg":"          total time =  127521.45 ms","slot_id":1,"t_prompt_processing":21135.362,"t_token_generation":106386.093,"t_total":127521.45499999999,"task_id":980,"tid":"0x1ef64fac0","timestamp":1709694513}
{"function":"update_slots","level":"INFO","line":1644,"msg":"slot released","n_cache_tokens":401,"n_ctx":8192,"n_past":977,"n_system_tokens":0,"slot_id":1,"task_id":980,"tid":"0x1ef64fac0","timestamp":1709694513,"truncated":false}
{"function":"log_server_request","level":"INFO","line":2737,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":58721,"status":200,"tid":"0x16e1d7000","timestamp":1709694513}

ggerganov · 2024-03-06T07:46:31Z

No updates. Short term we will drop multimodal support from server (#5882) and will potentially re-introduce it in the future. But not high-priority atm

jacob-hansen · 2024-03-06T16:27:43Z

I'm using a custom model and observe different results. Batching did provide the image to each slot, but messed with the generation process. Slot 0 was as expected, but all other slots responded in Simplified Chinese (which is very unexpected). At first I thought it was garbly goop, but then I realized that it translates correctly to a near expected output. So, in some way, batch generation is working (just somehow messing with the generation process).

KohakuBlueleaf · 2024-03-06T16:34:08Z

I'm using a custom model and observe different results. Batching did provide the image to each slot, but messed with the generation process. Slot 0 was as expected, but all other slots responded in Simplified Chinese (which is very unexpected). At first I thought it was garbly goop, but then I realized that it translates correctly to a near expected output. So, in some way, batch generation is working (just somehow messing with the generation process).

Thx for your information!!!
Hope we will have workaround soon

tellsiddh · 2024-03-06T16:44:53Z

No updates. Short term we will drop multimodal support from server (#5882) and will potentially re-introduce it in the future. But not high-priority atm

Thank you for your update. I will try fixing the issue on my own time and let you know if there are any changes. Thank you for your work on llama cpp. It is amazing!

github-actions · 2024-05-07T01:06:41Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ElhamAhmedian · 2024-07-24T09:56:34Z

Has Multimodal support been re-introduced for server?

tellsiddh added the bug-unconfirmed label Feb 19, 2024

phymbert mentioned this issue Mar 12, 2024

server : improvements and maintenance #4216

Open

10 tasks

phymbert added server/webui llava labels Mar 22, 2024

github-actions bot added the stale label Apr 22, 2024

github-actions bot closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama cpp server not doing parallel inference for llava when using flags -np and -cb #5592

llama cpp server not doing parallel inference for llava when using flags -np and -cb #5592

tellsiddh commented Feb 19, 2024

jacob-hansen commented Mar 4, 2024

tellsiddh commented Mar 4, 2024 •

edited

Loading

jacob-hansen commented Mar 4, 2024 •

edited

Loading

KohakuBlueleaf commented Mar 6, 2024

KohakuBlueleaf commented Mar 6, 2024

ggerganov commented Mar 6, 2024

jacob-hansen commented Mar 6, 2024

KohakuBlueleaf commented Mar 6, 2024

tellsiddh commented Mar 6, 2024

github-actions bot commented May 7, 2024

ElhamAhmedian commented Jul 24, 2024

llama cpp server not doing parallel inference for llava when using flags -np and -cb #5592

llama cpp server not doing parallel inference for llava when using flags -np and -cb #5592

Comments

tellsiddh commented Feb 19, 2024

jacob-hansen commented Mar 4, 2024

tellsiddh commented Mar 4, 2024 • edited Loading

jacob-hansen commented Mar 4, 2024 • edited Loading

KohakuBlueleaf commented Mar 6, 2024

KohakuBlueleaf commented Mar 6, 2024

ggerganov commented Mar 6, 2024

jacob-hansen commented Mar 6, 2024

KohakuBlueleaf commented Mar 6, 2024

tellsiddh commented Mar 6, 2024

github-actions bot commented May 7, 2024

ElhamAhmedian commented Jul 24, 2024

tellsiddh commented Mar 4, 2024 •

edited

Loading

jacob-hansen commented Mar 4, 2024 •

edited

Loading