Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

about dert code problem(transformer part) #44

Open
darewolf007 opened this issue Mar 27, 2024 · 1 comment
Open

about dert code problem(transformer part) #44

darewolf007 opened this issue Mar 27, 2024 · 1 comment

Comments

@darewolf007
Copy link

darewolf007 commented Mar 27, 2024

Hello, in your detr code, you use transformer get the output is [bs, hidden_dim, feature_dim], the code is

self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]

the transformer code is

hs = self.decoder(tgt, memory, memory_key_padding_mask=mask, pos=pos_embed, query_pos=query_embed)
hs = hs.transpose(1, 2)
return hs

Based on my understanding, in your code, you only choose the first decoder layer output as the feature to predict the action. However, i see the original detr code the transformer output is:

hs = self.decoder(tgt, memory, memory_key_padding_mask=mask, pos=pos_embed, query_pos=query_embed)
return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)

The original detr code use the same feature processing code

hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]
outputs_class = self.class_embed(hs)

I would like to ask why only the first-layer output is chosen as the feature. Would selecting the seventh layer be a better choice? Thank you!!!

@ka2hyeon
Copy link

I think the authors made a mistake when they cherry-pick the original DETR code.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants