-
Notifications
You must be signed in to change notification settings - Fork 11.5k
llama cpp server not doing parallel inference for llava when using flags -np and -cb #5592
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
How did you get the response? I'm struggling to figure out how to post a request to the llama cpp server running. Would you be able to provide an example? E.g. what url (/v1/chat/completions??) |
the url i used is host:port/completions What errors are you getting? |
I got it to work! Thank you. For reference to anyone who finds this thread:
The part I was messing up was the PNG buffered format and the endpoint I was posting to. |
@ggerganov Any updates here? When setting to same id. The effect is basically as same as what you posted in the PR for llava and batch processing of server. #3677 After setting to different id. It looks like only 1 slot (in my case, slot-0 too) have image data include in inference. Here is some example:
Inference code import asyncio
import base64
import copy
from httpx import AsyncClient
from objprint import objprint
client = AsyncClient(timeout=3600)
URL = "http://127.0.0.1:8080/completion"
DATA = {
"image_data": [],
"n_predict": 400,
"prompt": "",
"repeat_last_n": 128,
"repeat_penalty": 1.2,
"slot_id": -1,
"stop": ["</s>", "ASSISTANT:", "USER:"],
"top_k": 40,
"top_p": 0.9,
"temperature": 0.1,
}
SLOTS = 4
rq_count = 0
def construct_data(prompt, image, slot_id):
if slot_id == -1:
slot_id = rq_count % SLOTS
img_id = 10+slot_id
prompt = prompt.replace("<img>", f"[img-{img_id}]")
img_str = base64.b64encode(open(image, "rb").read()).decode("utf-8")
data = copy.deepcopy(DATA)
data["image_data"] = [{
"id": img_id,
"data": img_str
}]
data["prompt"] = "prompt"
data["slot_id"] = slot_id
return data
async def rq_img(image):
global rq_count
data = construct_data(
"USER: <img> Describe this Image with short sentence.\nASSISTANT:",
"./test.jpg",
-1
)
rq_count += 1
resp = await client.post(URL, json=data)
try:
resp = resp.json()
except:
resp = resp.text
return resp, data["slot_id"]
async def main():
image = "./test.jpg"
result = await asyncio.gather(*(rq_img(image) for _ in range(4)))
for res in result:
print(f"slot={res[1]}")
objprint(res[0]["content"])
print("\n\n")
asyncio.run(main()) And the result:
|
Server log is here:
|
No updates. Short term we will drop multimodal support from |
I'm using a custom model and observe different results. Batching did provide the image to each slot, but messed with the generation process. Slot 0 was as expected, but all other slots responded in Simplified Chinese (which is very unexpected). At first I thought it was garbly goop, but then I realized that it translates correctly to a near expected output. So, in some way, batch generation is working (just somehow messing with the generation process). |
Thx for your information!!! |
Thank you for your update. I will try fixing the issue on my own time and let you know if there are any changes. Thank you for your work on llama cpp. It is amazing! |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Has Multimodal support been re-introduced for server? |
When I am trying to do parallel inferencing on llama cpp server for multimodal, I am getting the correct output for slot 0, but for other slots, I am not. Does that mean that clip is only being loaded on one slot? I can see some clip layers failing to load.
Here is my llama cpp server code that I use.
./server -m ../models/llava13b1_5/llava13b1_5_f16.gguf -c 40960 --n-gpu-layers 41 --port 8001 --mmproj ../models/llava13b1_5/llava13b1_5_mmproj_f16.gguf -np 10 -cb --host 0.0.0.0 --threads 24
The model I am using -
https://huggingface.co/mys/ggml_llava-v1.5-13b/tree/main
I am using the F16 model with mmproj file.
Documentation reference
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
My GPU specs
My CPU specs
Loading llama cpp server for llava, using slot 0 for inference.
When using the other slot, that is parallel inferencing -
Prompt
model_type parameter in my payload is only for a proxy server that is rerouting all the requests.
Image looks like this
The text was updated successfully, but these errors were encountered: