Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Inquiry About PatchMerger Implementation in Image Embedding #633

Closed
HeimingX opened this issue Jan 5, 2025 · 2 comments
Closed

Inquiry About PatchMerger Implementation in Image Embedding #633

HeimingX opened this issue Jan 5, 2025 · 2 comments

Comments

@HeimingX
Copy link

HeimingX commented Jan 5, 2025

Dear Qwen2-VL Team,

First, I want to thank you for your remarkable contributions to the open-source community with Qwen2-VL. The model’s outstanding performance and innovative design are truly impressive.

While exploring the implementation, I noticed that the PatchMerger module seems to reduce token count by merging adjacent patches based on their order in the linear patchify sequence.

class PatchMerger(nn.Module):
    def __init__(self, dim: int, context_dim: int, spatial_merge_size: int = 2) -> None:
        super().__init__()
        self.hidden_size = context_dim * (spatial_merge_size**2)
        self.ln_q = LayerNorm(context_dim, eps=1e-6)
        self.mlp = nn.Sequential(
            nn.Linear(self.hidden_size, self.hidden_size),
            nn.GELU(),
            nn.Linear(self.hidden_size, dim),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.mlp(self.ln_q(x).view(-1, self.hidden_size))
        return x

https://github.com/huggingface/transformers/blob/e5fd865ebae062b7cf03a81b8c6affeb39f30bec/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L310

However, this approach doesn’t seem to account for the 2D spatial arrangement of patches in the original image. For example, the last patch of one row might be merged with the first patch of the next row.

I’m curious if this design choice might lead to potential issues where merged patches could represent entirely different semantics, given their distant spatial relationship in the image. Would this have any impact on the model’s ability to encode accurate image representations?

I’d appreciate it if you could share the reasoning behind this implementation and whether alternative approaches preserving 2D spatial locality were considered.

Thank you again for this incredible project, and I look forward to your insights!

@chenghuaWang
Copy link

You can refer to https://github.com/huggingface/transformers/blob/15bd3e61f8d3680ca472c9314ad07584d20f7b81/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L311, if I'm not mistaken. The 2D case you mentioned is already handled during preprocessing.

@HeimingX
Copy link
Author

Thanks a lot for solving my concerns!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants