Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Gaudi][Model] Qwen2.5-vl #870

Open
wants to merge 34 commits into
base: habana_main
Choose a base branch
from

Conversation

malkomes
Copy link

@malkomes malkomes commented Feb 26, 2025

Initial enablement of Qwen2.5-vl for Gaudi HPU
Based on vllm-project#12604 it FIXES: vllm-project#12486, vllm-project#12532

  • Introduce the flag HPU_DISABLE_TENSOR_CACHE to set disable_tensor_cache in htorch.hpu.wrap_in_hpu_graph. It keeps the default value as True for all models but we set it to False for MRoPE models such as Qwen2.5-vl.
  • Computes MRoPE positions and deltas for the HPU model runner.

Note

Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE=false to run qwen models, see README_GAUDI.
To install the VLLM with qwen2.5-VL enabled:

pip install -r requirements-hpu.txt; pip install -r requirements-hpu-qwen2_5_vl.txt ; python setup.py develop

--
Co-authored-by: Mohit Deopujari mohit.deopujari@intel.com
Co-authored-by: Jimin Ha jimin.ha@intel.com
Co-authored-by: Pallavi Jaini pallavi.jaini@intel.com
Co-authored-by: Deepak Narayana deepak.narayana@intel.com
Co-authored-by: Sayantan Sarkar sayantan.sarkar@intel.com
Co-authored-by: Iman Gohari s.m.iman.gohari@intel.com

@imangohari1
Copy link

I have clean cloned this branch and tested the qwen2.5-vl pytests.
all 12 tests pass. below are the details.

$ pip install -r requirements-hpu.txt; pip install -r requirements-hpu-qwen2_5_vl.txt ; python setup.py develop
$ VLLM_SKIP_WARMUP=true pytest tests/models/decoder_only/vision_language/test_models.py -s -v -k "[qwen2_5"
INFO 02-27 17:31:46 __init__.py:199] Automatically detected platform hpu.
================================================================================================================================================ test session starts =================================================================================================================================================
platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
rootdir: /devops/sgohari/tests/jira/hs-4927/pr/vllm-fork
configfile: pyproject.toml
plugins: anyio-4.8.0, typeguard-4.3.0
collected 185 items / 173 deselected / 12 selected                                                                                                                                                                                                                                                                   

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[qwen2_5_vl-test_case28] INFO 02-27 17:31:59 config.py:548] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-27 17:31:59 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev5293+gff97945) with config: model='Qwen/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-VL-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 02-27 17:32:01 utils.py:2359] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter,scheduler_config not implemented in <vllm.worker.hpu_worker.HPUWorker object at 0x7fba9599ba90>
WARNING 02-27 17:32:01 hpu.py:84] Pin memory is not supported on HPU.
INFO 02-27 17:32:01 hpu.py:35] Using HPUAttention backend.
VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
VLLM_PROMPT_BS_BUCKET_MAX=2 (default:2)
VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
VLLM_DECODE_BS_BUCKET_MAX=2 (default:2)
VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_MAX=128 (default:128)
Prompt bucket config (min, step, max_warmup) bs:[1, 32, 2], seq:[128, 128, 1024]
Decode bucket config (min, step, max_warmup) bs:[1, 32, 2], block:[128, 128, 128]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 20
CPU RAM       : 113320300 KB
------------------------------------------------------------------------------
INFO 02-27 17:32:05 config.py:2992] cudagraph sizes specified by model runner [] is overridden by config []
Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
INFO 02-27 17:32:06 loader.py:423] Loading weights on hpu...
INFO 02-27 17:32:06 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:31<00:31, 31.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 40.84s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 39.37s/it]
.
.
.
.
.
============================================================================================================================ 12 passed, 173 deselected, 59 warnings in 1558.82s (0:25:58) ==============================================

I will do more testing with image, video and mixed prompts next.
CC: @malkomes @jiminha

@malkomes
Copy link
Author

thanks for the review, @michalkuligowski
I think I addressed your comments, let me know if I missed anything.

@imangohari1
Copy link

@dsocek Adding Daniel to take a look here too.

@jiminha
Copy link

jiminha commented Mar 3, 2025

@libinta FYI,

@malkomes malkomes added the New Model Issue o PR to enable a new model label Mar 4, 2025
@malkomes
Copy link
Author

malkomes commented Mar 4, 2025

@michalkuligowski any more suggestions? just sync with main and rebased the branch

@@ -0,0 +1 @@
transformers @ git+https://github.com/huggingface/transformers.git@6b550462139655d488d4c663086a63e98713c6b9

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add new reuqirement file per model. Why is a specific sha required? I believe this should be added to readme rather.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qwen2.5-VL is officially supported from Transformer v4.49.0. However currently our VLLM-fork is out of date and support only v4.48.3. v4.48.3 doesn't support qwen2.5-VL though, and the vllm-fork code is also out of date, and can't use 4.49.

 File "/root/tf/qwen/vllm-fork-w2/vllm/model_executor/models/registry.py", line 370, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['Qwen2_5_VLForConditionalGeneration'] failed to be inspected. Please check the logs for more details.

For now, this specific commit works for qwen2_5_VL without changing too much. Once we update VLLM-fork to the latest, and transformer to 4.49, all of these can go away.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalkuligowski FYI: We raised this error on upstream vllm repo, and they mentioned it's bc of the vllm-fork version. vllm-project#12932 (comment)

@@ -278,7 +278,7 @@ def check_available_online(
"Qwen2AudioForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2-Audio-7B-Instruct"), # noqa: E501
"Qwen2VLForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2-VL-2B-Instruct"), # noqa: E501
"Qwen2_5_VLForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2.5-VL-3B-Instruct", # noqa: E501
min_transformers_version="4.49"), # noqa: E501
min_transformers_version="4.48.9"), # noqa: E501

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this decreased?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comment above related to transformer version.

@@ -71,6 +71,7 @@
from .vision import get_vit_attn_backend

logger = init_logger(__name__)
is_hpu = current_platform.is_hpu()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in one place here, so I think you dont need to save a variable, this will make as little changes to model file as possible

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need this for FusedSDPA, will update the code.

@@ -223,6 +224,36 @@ def find_rope_layer(parent, path):
return path_to_rope


def make_mrope_positions_tensor_with_pad( \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move to utils.py

dtype=torch.long,
device='cpu')
if self.model_is_mrope:
input_positions_tensor = \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not add another variable.
Also can this if-else clause be simplified further?

Comment on lines 1375 to 1389
if self.model_is_mrope:
input_positions = None # type: ignore
else:
input_mrope_positions = None # type: ignore

input_positions = torch.tensor(input_positions
or input_mrope_positions,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be simplified to input_mrope_positions if self.model_is_mrope else input_positions in torch.tensor call?

Copy link
Author

@malkomes malkomes Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we mostly follow the CPU code, but I agree that it could be simplified. Just rebase and apply these changes

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
New Model Issue o PR to enable a new model
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Model]: Qwen2.5-VL
6 participants