You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The model is qwen2-vl, and the request contains both text and image prompts.
It is normal when call one request at one time, but get error when parallel requesting with ‘max_ongoing_requests >= 2’.
The error stack shows below:
ERROR 2025-01-23 00:22:21,963 vl_VLLMDeployment 4gtvteb2 e1d433cc-e551-4e5e-b10e-986dea9fe1ad /v1/chat/completions llm.py:128 - Error in generate()
Traceback (most recent call last):
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner_base.py”, line 116, in _wrapper
return func(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py”, line 1654, in execute_model
hidden_or_intermediate_states = model_executable(
File “/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1747, in _call_impl
return forward_call(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_vl.py”, line 1287, in forward
inputs_embeds = self._merge_multimodal_embeddings(
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_vl.py”, line 1237, in _merge_multimodal_embeddings
inputs_embeds[mask, :] = multimodal_embeddings
RuntimeError: shape mismatch: value tensor of shape [644, 3584] cannot be broadcast to indexing result of shape [322, 3584]
The text was updated successfully, but these errors were encountered:
Parallel requests to a ray serve ‘OpenAI Chat Completions API’ based on this instruction: Serve a Large Language Model with vLLM — Ray 2.41.0
The model is qwen2-vl, and the request contains both text and image prompts.
It is normal when call one request at one time, but get error when parallel requesting with ‘max_ongoing_requests >= 2’.
The error stack shows below:
ERROR 2025-01-23 00:22:21,963 vl_VLLMDeployment 4gtvteb2 e1d433cc-e551-4e5e-b10e-986dea9fe1ad /v1/chat/completions llm.py:128 - Error in generate()
Traceback (most recent call last):
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner_base.py”, line 116, in _wrapper
return func(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py”, line 1654, in execute_model
hidden_or_intermediate_states = model_executable(
File “/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1747, in _call_impl
return forward_call(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_vl.py”, line 1287, in forward
inputs_embeds = self._merge_multimodal_embeddings(
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_vl.py”, line 1237, in _merge_multimodal_embeddings
inputs_embeds[mask, :] = multimodal_embeddings
RuntimeError: shape mismatch: value tensor of shape [644, 3584] cannot be broadcast to indexing result of shape [322, 3584]
The text was updated successfully, but these errors were encountered: