Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ReLU in residual connections? #7

Closed
ibagur opened this issue Feb 8, 2023 · 4 comments
Closed

ReLU in residual connections? #7

ibagur opened this issue Feb 8, 2023 · 4 comments

Comments

@ibagur
Copy link

ibagur commented Feb 8, 2023

Hi,

I am using part of your code for a particular implementation of a transformer architecture I need as part of my master thesis research in RL. I noticed on the original paper from (Parisotto et al., 2019) that they re-order the LayerNorms so they place them at the input of both the multihead-attention and the feed-forward sub-modules. I saw that you also implement this on your code, via a the config["layer_norm"] setting. But on the paper they also mention, I quote: "Because the layer norm reordering causes a path where two linear layers are applied in sequence, we apply a ReLU activation to each sub-module output before the residual connection (see Appendix C for equations).". In fact, on those equations they apply a ReLU both to the output of the multihead-attention and feed-forward sub-modules, before performing the residual connection. I did not see that specific step on your code (just the standard residual connection), so I wonder whether there is a particular reason for that, or maybe I am missing something (I'm still quite novice in these implementations). In any case, congratulations for your great works, it is helping me a lot to understand the inner workings of such architectures. Thanks!

@MarcoMeter
Copy link
Owner

Thanks for bringing up this issue. GTrXL is still work in progress. We will investigate this detail on ReLU.

@MarcoMeter
Copy link
Owner

MarcoMeter commented Feb 13, 2023

I took some time to train TrXL with pre layer norm plus the missing ReLU activations on Mortar Mayhem Grid and Mystery Path Grid. I did not observe any gains in performance.

# GRU Gate or skip connection
if self.use_gtrxl:
    # Forward GRU gating
    h = self.gate1(query, attention)
else:
    if self.layer_norm == "pre":
        attention = F.relu(attention)
    # Skip connection
    h = attention + query
# GRU Gate or skip connection
if self.use_gtrxl:
    # Forward GRU gating
    out = self.gate2(h, forward)
else:
    if self.layer_norm == "pre":
        forward = F.relu(forward)
    # Skip connection
    out = forward + h

Here is a rough plot on the training performance in Mystery Path Grid. Each experiment was run for 4 different seeds.
mpg_trxl_gtrxl

@MarcoMeter
Copy link
Owner

I'm closing this issue for now. We decided to not add the ReLU activations. We don't have the time to further investigate this right now.

@ibagur
Copy link
Author

ibagur commented Feb 15, 2023

Hi,

Many thanks for your feedback and very interesting results. Very much appreciated, I might give it a look on my own research, but as I can see, it seems not to have much effect (apart of adding a little bit more of computation in the model)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants