Skip to content

Commit 8bddc36

Browse files
DarkLight1337FerdinandZhong
authored andcommitted
[Model] VLM2Vec, the first multimodal embedding model in vLLM (vllm-project#9303)
Signed-off-by: qishuai <ferdinandzhong@gmail.com>
1 parent c2121f2 commit 8bddc36

File tree

16 files changed

+465
-261
lines changed

16 files changed

+465
-261
lines changed

docs/source/models/supported_models.rst

+53-26
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
Supported Models
44
================
55

6-
vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
6+
vLLM supports a variety of generative Transformer models in `HuggingFace (HF) Transformers <https://huggingface.co/models>`_.
77
The following is the list of model architectures that are currently supported by vLLM.
88
Alongside each architecture, we include some popular models that use it.
99

@@ -19,7 +19,7 @@ Text Generation
1919

2020
* - Architecture
2121
- Models
22-
- Example HuggingFace Models
22+
- Example HF Models
2323
- :ref:`LoRA <lora>`
2424
- :ref:`PP <distributed_serving>`
2525
* - :code:`AquilaForCausalLM`
@@ -280,7 +280,7 @@ Text Embedding
280280

281281
* - Architecture
282282
- Models
283-
- Example HuggingFace Models
283+
- Example HF Models
284284
- :ref:`LoRA <lora>`
285285
- :ref:`PP <distributed_serving>`
286286
* - :code:`Gemma2Model`
@@ -303,7 +303,7 @@ Reward Modeling
303303

304304
* - Architecture
305305
- Models
306-
- Example HuggingFace Models
306+
- Example HF Models
307307
- :ref:`LoRA <lora>`
308308
- :ref:`PP <distributed_serving>`
309309
* - :code:`Qwen2ForRewardModel`
@@ -316,86 +316,93 @@ Reward Modeling
316316
As an interim measure, these models are supported via Embeddings API. See `this RFC <https://github.com/vllm-project/vllm/issues/8967>`_ for upcoming changes.
317317

318318
Multimodal Language Models
319-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
319+
^^^^^^^^^^^^^^^^^^^^^^^^^^
320+
321+
The following modalities are supported depending on the model:
322+
323+
- **T**\ ext
324+
- **I**\ mage
325+
- **V**\ ideo
326+
- **A**\ udio
320327

321328
.. _supported_vlms:
322329

323330
Text Generation
324331
---------------
325332

326333
.. list-table::
327-
:widths: 25 25 25 25 5 5
334+
:widths: 25 25 15 25 5 5
328335
:header-rows: 1
329336

330337
* - Architecture
331338
- Models
332-
- Modalities
333-
- Example HuggingFace Models
339+
- Inputs
340+
- Example HF Models
334341
- :ref:`LoRA <lora>`
335342
- :ref:`PP <distributed_serving>`
336343
* - :code:`Blip2ForConditionalGeneration`
337344
- BLIP-2
338-
- Image\ :sup:`E`
345+
- T + I\ :sup:`E`
339346
- :code:`Salesforce/blip2-opt-2.7b`, :code:`Salesforce/blip2-opt-6.7b`, etc.
340347
-
341348
- ✅︎
342349
* - :code:`ChameleonForConditionalGeneration`
343350
- Chameleon
344-
- Image
351+
- T + I
345352
- :code:`facebook/chameleon-7b` etc.
346353
-
347354
- ✅︎
348355
* - :code:`FuyuForCausalLM`
349356
- Fuyu
350-
- Image
357+
- T + I
351358
- :code:`adept/fuyu-8b` etc.
352359
-
353360
- ✅︎
354361
* - :code:`ChatGLMModel`
355362
- GLM-4V
356-
- Image
363+
- T + I
357364
- :code:`THUDM/glm-4v-9b` etc.
358365
-
359366
- ✅︎
360367
* - :code:`InternVLChatModel`
361368
- InternVL2
362-
- Image\ :sup:`E+`
369+
- T + I\ :sup:`E+`
363370
- :code:`OpenGVLab/InternVL2-4B`, :code:`OpenGVLab/InternVL2-8B`, etc.
364371
-
365372
- ✅︎
366373
* - :code:`LlavaForConditionalGeneration`
367374
- LLaVA-1.5
368-
- Image\ :sup:`E+`
375+
- T + I\ :sup:`E+`
369376
- :code:`llava-hf/llava-1.5-7b-hf`, :code:`llava-hf/llava-1.5-13b-hf`, etc.
370377
-
371378
- ✅︎
372379
* - :code:`LlavaNextForConditionalGeneration`
373380
- LLaVA-NeXT
374-
- Image\ :sup:`E+`
381+
- T + I\ :sup:`E+`
375382
- :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
376383
-
377384
- ✅︎
378385
* - :code:`LlavaNextVideoForConditionalGeneration`
379386
- LLaVA-NeXT-Video
380-
- Video
387+
- T + V
381388
- :code:`llava-hf/LLaVA-NeXT-Video-7B-hf`, etc.
382389
-
383390
- ✅︎
384391
* - :code:`LlavaOnevisionForConditionalGeneration`
385392
- LLaVA-Onevision
386-
- Image\ :sup:`+` / Video
393+
- T + I\ :sup:`+` + V
387394
- :code:`llava-hf/llava-onevision-qwen2-7b-ov-hf`, :code:`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc.
388395
-
389396
- ✅︎
390397
* - :code:`MiniCPMV`
391398
- MiniCPM-V
392-
- Image\ :sup:`E+`
399+
- T + I\ :sup:`E+`
393400
- :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, :code:`openbmb/MiniCPM-V-2_6`, etc.
394401
- ✅︎
395402
- ✅︎
396403
* - :code:`MllamaForConditionalGeneration`
397404
- Llama 3.2
398-
- Image
405+
- T + I
399406
- :code:`meta-llama/Llama-3.2-90B-Vision-Instruct`, :code:`meta-llama/Llama-3.2-11B-Vision`, etc.
400407
-
401408
-
@@ -407,43 +414,43 @@ Text Generation
407414
- ✅︎
408415
* - :code:`NVLM_D_Model`
409416
- NVLM-D 1.0
410-
- Image\ :sup:`E+`
417+
- T + I\ :sup:`E+`
411418
- :code:`nvidia/NVLM-D-72B`, etc.
412419
-
413420
- ✅︎
414421
* - :code:`PaliGemmaForConditionalGeneration`
415422
- PaliGemma
416-
- Image\ :sup:`E`
423+
- T + I\ :sup:`E`
417424
- :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, etc.
418425
-
419426
- ✅︎
420427
* - :code:`Phi3VForCausalLM`
421428
- Phi-3-Vision, Phi-3.5-Vision
422-
- Image\ :sup:`E+`
429+
- T + I\ :sup:`E+`
423430
- :code:`microsoft/Phi-3-vision-128k-instruct`, :code:`microsoft/Phi-3.5-vision-instruct` etc.
424431
-
425432
- ✅︎
426433
* - :code:`PixtralForConditionalGeneration`
427434
- Pixtral
428-
- Image\ :sup:`+`
435+
- T + I\ :sup:`+`
429436
- :code:`mistralai/Pixtral-12B-2409`
430437
-
431438
- ✅︎
432439
* - :code:`QWenLMHeadModel`
433440
- Qwen-VL
434-
- Image\ :sup:`E+`
441+
- T + I\ :sup:`E+`
435442
- :code:`Qwen/Qwen-VL`, :code:`Qwen/Qwen-VL-Chat`, etc.
436443
-
437444
- ✅︎
438445
* - :code:`Qwen2VLForConditionalGeneration`
439446
- Qwen2-VL
440-
- Image\ :sup:`E+` / Video\ :sup:`+`
447+
- T + I\ :sup:`E+` + V\ :sup:`+`
441448
- :code:`Qwen/Qwen2-VL-2B-Instruct`, :code:`Qwen/Qwen2-VL-7B-Instruct`, :code:`Qwen/Qwen2-VL-72B-Instruct`, etc.
442449
-
443450
- ✅︎
444451
* - :code:`UltravoxModel`
445452
- Ultravox
446-
- Audio\ :sup:`E+`
453+
- T + A\ :sup:`E+`
447454
- :code:`fixie-ai/ultravox-v0_3`
448455
-
449456
- ✅︎
@@ -455,6 +462,26 @@ Text Generation
455462
For :code:`openbmb/MiniCPM-V-2`, the official repo doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
456463
For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
457464

465+
Multimodal Embedding
466+
--------------------
467+
468+
.. list-table::
469+
:widths: 25 25 15 25 5 5
470+
:header-rows: 1
471+
472+
* - Architecture
473+
- Models
474+
- Inputs
475+
- Example HF Models
476+
- :ref:`LoRA <lora>`
477+
- :ref:`PP <distributed_serving>`
478+
* - :code:`Phi3VForCausalLM`
479+
- Phi-3-Vision-based
480+
- T + I
481+
- :code:`TIGER-Lab/VLM2Vec-Full`
482+
- 🚧
483+
- ✅︎
484+
458485
----
459486

460487
If your model uses one of the above model architectures, you can seamlessly run your model with vLLM.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
from vllm import LLM
2+
from vllm.assets.image import ImageAsset
3+
4+
image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
5+
prompt = "<|image_1|> Represent the given image with the following question: What is in the image" # noqa: E501
6+
7+
# Create an LLM.
8+
llm = LLM(
9+
model="TIGER-Lab/VLM2Vec-Full",
10+
trust_remote_code=True,
11+
max_model_len=4096,
12+
max_num_seqs=2,
13+
mm_processor_kwargs={"num_crops": 16},
14+
)
15+
16+
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
17+
outputs = llm.encode({"prompt": prompt, "multi_modal_data": {"image": image}})
18+
19+
# Print the outputs.
20+
for output in outputs:
21+
print(output.outputs.embedding) # list of 3072 floats

0 commit comments

Comments
 (0)