Shape mismatch when parallel requesting to a vllm ray serve base on openai chat completion apis and qwen2-vl #50033

javasy · 2025-01-23T08:37:28Z

Parallel requests to a ray serve ‘OpenAI Chat Completions API’ based on this instruction: Serve a Large Language Model with vLLM — Ray 2.41.0

The model is qwen2-vl, and the request contains both text and image prompts.

It is normal when call one request at one time, but get error when parallel requesting with ‘max_ongoing_requests >= 2’.

The error stack shows below：

ERROR 2025-01-23 00:22:21,963 vl_VLLMDeployment 4gtvteb2 e1d433cc-e551-4e5e-b10e-986dea9fe1ad /v1/chat/completions llm.py:128 - Error in generate()
Traceback (most recent call last):
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner_base.py”, line 116, in _wrapper
return func(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py”, line 1654, in execute_model
hidden_or_intermediate_states = model_executable(
File “/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1747, in _call_impl
return forward_call(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_vl.py”, line 1287, in forward
inputs_embeds = self._merge_multimodal_embeddings(
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_vl.py”, line 1237, in _merge_multimodal_embeddings
inputs_embeds[mask, :] = multimodal_embeddings
RuntimeError: shape mismatch: value tensor of shape [644, 3584] cannot be broadcast to indexing result of shape [322, 3584]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shape mismatch when parallel requesting to a vllm ray serve base on openai chat completion apis and qwen2-vl #50033

Shape mismatch when parallel requesting to a vllm ray serve base on openai chat completion apis and qwen2-vl #50033

javasy commented Jan 23, 2025

Shape mismatch when parallel requesting to a vllm ray serve base on openai chat completion apis and qwen2-vl #50033

Shape mismatch when parallel requesting to a vllm ray serve base on openai chat completion apis and qwen2-vl #50033

Comments

javasy commented Jan 23, 2025