[Gaudi][Model] Qwen2.5-vl #870

malkomes · 2025-02-26T06:38:50Z

Initial enablement of Qwen2.5-vl for Gaudi HPU
Based on vllm-project#12604 it FIXES: vllm-project#12486, vllm-project#12532

Introduce the flag HPU_DISABLE_TENSOR_CACHE to set disable_tensor_cache in htorch.hpu.wrap_in_hpu_graph. It keeps the default value as True for all models but we set it to False for MRoPE models such as Qwen2.5-vl.
Computes MRoPE positions and deltas for the HPU model runner.

Note

Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE=false to run qwen models, see README_GAUDI.
To install the VLLM with qwen2.5-VL enabled:

pip install -r requirements-hpu.txt; pip install -r requirements-hpu-qwen2_5_vl.txt ; python setup.py develop

--
Co-authored-by: Mohit Deopujari mohit.deopujari@intel.com
Co-authored-by: Jimin Ha jimin.ha@intel.com
Co-authored-by: Pallavi Jaini pallavi.jaini@intel.com
Co-authored-by: Deepak Narayana deepak.narayana@intel.com
Co-authored-by: Sayantan Sarkar sayantan.sarkar@intel.com
Co-authored-by: Iman Gohari s.m.iman.gohari@intel.com

requirements-hpu-qwen2_5_vl.txt

imangohari1 · 2025-02-27T18:16:49Z

I have clean cloned this branch and tested the qwen2.5-vl pytests.
all 12 tests pass. below are the details.

$ pip install -r requirements-hpu.txt; pip install -r requirements-hpu-qwen2_5_vl.txt ; python setup.py develop
$ VLLM_SKIP_WARMUP=true pytest tests/models/decoder_only/vision_language/test_models.py -s -v -k "[qwen2_5"

INFO 02-27 17:31:46 __init__.py:199] Automatically detected platform hpu.
================================================================================================================================================ test session starts =================================================================================================================================================
platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
rootdir: /devops/sgohari/tests/jira/hs-4927/pr/vllm-fork
configfile: pyproject.toml
plugins: anyio-4.8.0, typeguard-4.3.0
collected 185 items / 173 deselected / 12 selected                                                                                                                                                                                                                                                                   

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[qwen2_5_vl-test_case28] INFO 02-27 17:31:59 config.py:548] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-27 17:31:59 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev5293+gff97945) with config: model='Qwen/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-VL-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 02-27 17:32:01 utils.py:2359] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter,scheduler_config not implemented in <vllm.worker.hpu_worker.HPUWorker object at 0x7fba9599ba90>
WARNING 02-27 17:32:01 hpu.py:84] Pin memory is not supported on HPU.
INFO 02-27 17:32:01 hpu.py:35] Using HPUAttention backend.
VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
VLLM_PROMPT_BS_BUCKET_MAX=2 (default:2)
VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
VLLM_DECODE_BS_BUCKET_MAX=2 (default:2)
VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_MAX=128 (default:128)
Prompt bucket config (min, step, max_warmup) bs:[1, 32, 2], seq:[128, 128, 1024]
Decode bucket config (min, step, max_warmup) bs:[1, 32, 2], block:[128, 128, 128]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 20
CPU RAM       : 113320300 KB
------------------------------------------------------------------------------
INFO 02-27 17:32:05 config.py:2992] cudagraph sizes specified by model runner [] is overridden by config []
Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
INFO 02-27 17:32:06 loader.py:423] Loading weights on hpu...
INFO 02-27 17:32:06 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:31<00:31, 31.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 40.84s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 39.37s/it]
.
.
.
.
.
============================================================================================================================ 12 passed, 173 deselected, 59 warnings in 1558.82s (0:25:58) ==============================================

I will do more testing with image, video and mixed prompts next.
CC: @malkomes @jiminha

vllm/model_executor/models/qwen2_5_vl.py

vllm/worker/hpu_model_runner.py

malkomes · 2025-02-28T17:01:02Z

thanks for the review, @michalkuligowski
I think I addressed your comments, let me know if I missed anything.

imangohari1 · 2025-02-28T18:14:03Z

@dsocek Adding Daniel to take a look here too.

jiminha · 2025-03-03T19:45:53Z

@libinta FYI,

malkomes · 2025-03-04T19:34:39Z

@michalkuligowski any more suggestions? just sync with main and rebased the branch

michalkuligowski · 2025-03-07T08:13:18Z

requirements-hpu-qwen2_5_vl.txt

@@ -0,0 +1 @@
+transformers @ git+https://github.com/huggingface/transformers.git@6b550462139655d488d4c663086a63e98713c6b9


Let's not add new reuqirement file per model. Why is a specific sha required? I believe this should be added to readme rather.

Qwen2.5-VL is officially supported from Transformer v4.49.0. However currently our VLLM-fork is out of date and support only v4.48.3. v4.48.3 doesn't support qwen2.5-VL though, and the vllm-fork code is also out of date, and can't use 4.49.

File "/root/tf/qwen/vllm-fork-w2/vllm/model_executor/models/registry.py", line 370, in _raise_for_unsupported raise ValueError( ValueError: Model architectures ['Qwen2_5_VLForConditionalGeneration'] failed to be inspected. Please check the logs for more details.

For now, this specific commit works for qwen2_5_VL without changing too much. Once we update VLLM-fork to the latest, and transformer to 4.49, all of these can go away.

@michalkuligowski FYI: We raised this error on upstream vllm repo, and they mentioned it's bc of the vllm-fork version. vllm-project#12932 (comment)

michalkuligowski · 2025-03-07T08:13:53Z

tests/models/registry.py

@@ -278,7 +278,7 @@ def check_available_online(
    "Qwen2AudioForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2-Audio-7B-Instruct"),  # noqa: E501
    "Qwen2VLForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2-VL-2B-Instruct"),  # noqa: E501
    "Qwen2_5_VLForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2.5-VL-3B-Instruct",  # noqa: E501
-                                                          min_transformers_version="4.49"),  # noqa: E501
+                                                          min_transformers_version="4.48.9"),  # noqa: E501


Why is this decreased?

Please see the comment above related to transformer version.

michalkuligowski · 2025-03-07T08:15:18Z

vllm/model_executor/models/qwen2_5_vl.py

@@ -71,6 +71,7 @@
 from .vision import get_vit_attn_backend

 logger = init_logger(__name__)
+is_hpu = current_platform.is_hpu()


This is used in one place here, so I think you dont need to save a variable, this will make as little changes to model file as possible

We also need this for FusedSDPA, will update the code.

michalkuligowski · 2025-03-07T08:28:18Z

vllm/worker/hpu_model_runner.py

@@ -223,6 +224,36 @@ def find_rope_layer(parent, path):
    return path_to_rope


+def make_mrope_positions_tensor_with_pad( \


Please move to utils.py

michalkuligowski · 2025-03-07T08:30:24Z

vllm/worker/hpu_model_runner.py

-                                               dtype=torch.long,
-                                               device='cpu')
+        if self.model_is_mrope:
+            input_positions_tensor = \


Lets not add another variable.
Also can this if-else clause be simplified further?

michalkuligowski · 2025-03-07T08:34:38Z

vllm/worker/hpu_model_runner.py

+        if self.model_is_mrope:
+            input_positions = None  # type: ignore
+        else:
+            input_mrope_positions = None  # type: ignore
+
+        input_positions = torch.tensor(input_positions
+                                       or input_mrope_positions,


Can this be simplified to input_mrope_positions if self.model_is_mrope else input_positions in torch.tensor call?

we mostly follow the CPU code, but I agree that it could be simplified. Just rebase and apply these changes

Fails in rotary_embed layer in the view

bypassing it with alternative pt code else it was editing image_grid_thw to 0,0,0 etc

running if we use enforce_eager: llm = LLM(model="Qwen/Qwen2-VL-7B-Instruct", enforce_eager=True)

Co-authored-by: Mohit Deopujari mohit.deopujari@intel.com Co-authored-by: Jimin Ha jimin.ha@intel.com Co-authored-by: Pallavi Jaini pallavi.jaini@intel.com Co-authored-by: Deepak Narayana deepak.narayana@intel.com Co-authored-by: Sayantan Sarkar sayantan.sarkar@intel.com Co-authored-by: Gustavo Malkomes gustavo.malkomes@intel.com

malkomes marked this pull request as ready for review February 27, 2025 06:13

malkomes requested review from kzawora-intel, madamczykhabana, michalkuligowski, mgawarkiewicz, vivekgoe and afierka-intel as code owners February 27, 2025 06:13

malkomes force-pushed the qwen2.5-vl-hpu branch from 0a20064 to ff97945 Compare February 27, 2025 06:14

michalkuligowski requested changes Feb 27, 2025

View reviewed changes

requirements-hpu-qwen2_5_vl.txt Show resolved Hide resolved

michalkuligowski requested changes Feb 28, 2025

View reviewed changes

vllm/model_executor/models/qwen2_5_vl.py Outdated Show resolved Hide resolved

vllm/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

vllm/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

michalkuligowski force-pushed the qwen2.5-vl-hpu branch from d4a8289 to 744875c Compare February 28, 2025 13:42

malkomes force-pushed the qwen2.5-vl-hpu branch from 34af355 to ccff671 Compare March 4, 2025 17:29

malkomes added the New Model Issue o PR to enable a new model label Mar 4, 2025

malkomes requested a review from michalkuligowski March 6, 2025 15:37

michalkuligowski requested changes Mar 7, 2025

View reviewed changes

ssarkar2 and others added 10 commits March 10, 2025 16:43

Initial commit

938ef83

Fails in rotary_embed layer in the view

Comments to trace execution diff between cpu/hpu

a3f884b

minor

c83c882

_validate_and_reshape_mm_tensor looks buggy...

c5f65f9

bypassing it with alternative pt code else it was editing image_grid_thw to 0,0,0 etc

Some comments regd buggy hpu graphs

fca160d

running if we use enforce_eager: llm = LLM(model="Qwen/Qwen2-VL-7B-Instruct", enforce_eager=True)

Return early to prevent mem profiling

095dbbd

Initial commit for the Qwen 2.5 VL

f557d99

workaround to make HPU graphs work. disable_tensor_cache set to false.

8c7a2b3

adding qwen2.5-vl to hpu + small cleanups

22bc3ef

removing duplicates CPU

d4a721c

malkomes and others added 21 commits March 10, 2025 16:43

small changes to work with llama-3.2-vl

5474d9b

skip profile_run for now

008fbb5

reshape positions in MRotaryEmbedding for HPU

f48d6fc

input positions [3, seq_len] or [seq_len,] for Qwen2.5vl

4caf383

fix the decoder

998d090

comment prints

cd1bbe0

cleanup

99f8e9f

polishing

9eac068

add type ignore

dcc2c6c

set HPU_DISABLE_TENSOR_CACHE to false for Qwen2.5vl

7c5871b

make lint happy?

fc9e7ee

Change torch dtype to bflat16 for qwen2.5-VL test

67b696e

add check_transformers to qwen2_5_VL

c986f8d

improving code and comments

08b35bf

lint

75eb21b

remove Optinal

70ef940

lint qwen2_5_vl

15d735c

add reviewers suggestions

f6b95f8

lint

175a927

remove blank line

5baa1ed

malkomes force-pushed the qwen2.5-vl-hpu branch from 3e2f0da to 5baa1ed Compare March 10, 2025 16:46

malkomes and others added 3 commits March 10, 2025 16:49

input_mrope_positions if/else simplifications

7fe109a

Enable FusedSDPA for Qwen2.5 VL

264676d

Lint fix

cb09a4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gaudi][Model] Qwen2.5-vl #870

[Gaudi][Model] Qwen2.5-vl #870

malkomes commented Feb 26, 2025 •

edited by github-actions bot

Loading

imangohari1 commented Feb 27, 2025

malkomes commented Feb 28, 2025

imangohari1 commented Feb 28, 2025

jiminha commented Mar 3, 2025

malkomes commented Mar 4, 2025

michalkuligowski Mar 7, 2025

jiminha Mar 7, 2025

imangohari1 Mar 7, 2025

michalkuligowski Mar 7, 2025

jiminha Mar 7, 2025

michalkuligowski Mar 7, 2025

jiminha Mar 7, 2025

michalkuligowski Mar 7, 2025

michalkuligowski Mar 7, 2025

michalkuligowski Mar 7, 2025

malkomes Mar 10, 2025 •

edited

Loading

		@@ -0,0 +1 @@
		transformers @ git+https://github.com/huggingface/transformers.git@6b550462139655d488d4c663086a63e98713c6b9

		@@ -223,6 +224,36 @@ def find_rope_layer(parent, path):
		return path_to_rope


		def make_mrope_positions_tensor_with_pad( \

[Gaudi][Model] Qwen2.5-vl #870

Are you sure you want to change the base?

[Gaudi][Model] Qwen2.5-vl #870

Conversation

malkomes commented Feb 26, 2025 • edited by github-actions bot Loading

imangohari1 commented Feb 27, 2025

malkomes commented Feb 28, 2025

imangohari1 commented Feb 28, 2025

jiminha commented Mar 3, 2025

malkomes commented Mar 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malkomes Mar 10, 2025 • edited Loading

Choose a reason for hiding this comment

malkomes commented Feb 26, 2025 •

edited by github-actions bot

Loading

malkomes Mar 10, 2025 •

edited

Loading