You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HI, I noticed that increasing the batch size train or inf results in a non-linear increase in computation time (we would except that linear increase would be an upper bound in some sense).
Ive saw it on my own environment, built another one and in the end tried also in the colab, here is the colab code:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import numpy as np
import time
import torch
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-2B-Instruct", torch_dtype="float16", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", torch_dtype="float16")
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "resized_height": 256, "resized_width": 256, "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
{"type": "image", "resized_height": 256, "resized_width": 256, "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
{"type": "text", "text": "What are the common elements in these pictures?"},
],
}
]
#messages = [messages1,messages1,messages1,messages1]
#messages = [messages1,messages1,messages1]
#messages = [messages1,messages1]
messages = [messages1]
# Preparation for batch inference
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
avg_list = []
for attempt in range(100):
with torch.no_grad():
torch.cuda.synchronize()
start = time.perf_counter()
generated_ids = model(**inputs)
torch.cuda.synchronize()
end = time.perf_counter()
avg_list.append(end-start)
print(f"dotproduct time: {end - start}")
print("Avg of 100:",np.mean(avg_list))
Since colab can be tricky in allocating his resources, I did 10 run of 100 iterations and removed outliers the results were:
Batch=4: 1.4s
Batch=3: 0.88s
Batch=2: 0.51s
Batch=1: 0.19s
We can see the non linear increase, further when the size of the image increase it is much sharper.
After debugging, it happens in the visual:
self.visual = Qwen2VisionTransformerPretrainedModel._from_config(config.vision_config)
IN the scaled_dot_product:
attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
And probably becuase you are treating the batch of images as a sequence and try to handle it with attention mask, but the seq length is biggest pain in transformers why is this the implementation?
Am I missing something?
The text was updated successfully, but these errors were encountered:
I encountered the same issue and noticed that the VisionTransformer in Qwen2-VL treats batched images as a single sequence. This leads to significant increases in memory usage during training, especially with larger batch sizes and high-resolution images. To address this, I modified the VisionTransformer in Qwen2-VL to use batch-based processing logic. Everything seems to work normally now. I also tested it using your script, and the inference time looks much more reasonable. I hope this solution helps!
HI, I noticed that increasing the batch size train or inf results in a non-linear increase in computation time (we would except that linear increase would be an upper bound in some sense).
Ive saw it on my own environment, built another one and in the end tried also in the colab, here is the colab code:
Since colab can be tricky in allocating his resources, I did 10 run of 100 iterations and removed outliers the results were:
Batch=4: 1.4s
Batch=3: 0.88s
Batch=2: 0.51s
Batch=1: 0.19s
We can see the non linear increase, further when the size of the image increase it is much sharper.
After debugging, it happens in the visual:
self.visual = Qwen2VisionTransformerPretrainedModel._from_config(config.vision_config)
IN the scaled_dot_product:
attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
And probably becuase you are treating the batch of images as a sequence and try to handle it with attention mask, but the seq length is biggest pain in transformers why is this the implementation?
Am I missing something?
The text was updated successfully, but these errors were encountered: