Inquiry About PatchMerger Implementation in Image Embedding #633

HeimingX · 2025-01-05T01:29:21Z

Dear Qwen2-VL Team,

First, I want to thank you for your remarkable contributions to the open-source community with Qwen2-VL. The model’s outstanding performance and innovative design are truly impressive.

While exploring the implementation, I noticed that the PatchMerger module seems to reduce token count by merging adjacent patches based on their order in the linear patchify sequence.

class PatchMerger(nn.Module):
    def __init__(self, dim: int, context_dim: int, spatial_merge_size: int = 2) -> None:
        super().__init__()
        self.hidden_size = context_dim * (spatial_merge_size**2)
        self.ln_q = LayerNorm(context_dim, eps=1e-6)
        self.mlp = nn.Sequential(
            nn.Linear(self.hidden_size, self.hidden_size),
            nn.GELU(),
            nn.Linear(self.hidden_size, dim),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.mlp(self.ln_q(x).view(-1, self.hidden_size))
        return x

https://github.com/huggingface/transformers/blob/e5fd865ebae062b7cf03a81b8c6affeb39f30bec/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L310

However, this approach doesn’t seem to account for the 2D spatial arrangement of patches in the original image. For example, the last patch of one row might be merged with the first patch of the next row.

I’m curious if this design choice might lead to potential issues where merged patches could represent entirely different semantics, given their distant spatial relationship in the image. Would this have any impact on the model’s ability to encode accurate image representations?

I’d appreciate it if you could share the reasoning behind this implementation and whether alternative approaches preserving 2D spatial locality were considered.

Thank you again for this incredible project, and I look forward to your insights!

chenghuaWang · 2025-01-13T04:40:02Z

You can refer to https://github.com/huggingface/transformers/blob/15bd3e61f8d3680ca472c9314ad07584d20f7b81/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L311, if I'm not mistaken. The 2D case you mentioned is already handled during preprocessing.

HeimingX · 2025-01-13T05:56:17Z

Thanks a lot for solving my concerns!

HeimingX closed this as completed Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry About PatchMerger Implementation in Image Embedding #633

Inquiry About PatchMerger Implementation in Image Embedding #633

HeimingX commented Jan 5, 2025

chenghuaWang commented Jan 13, 2025

HeimingX commented Jan 13, 2025

Inquiry About PatchMerger Implementation in Image Embedding #633

Inquiry About PatchMerger Implementation in Image Embedding #633

Comments

HeimingX commented Jan 5, 2025

chenghuaWang commented Jan 13, 2025

HeimingX commented Jan 13, 2025