You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
rand = torch.randn(inp.shape, device = x.device) ---> creates a random array of normal dist number (0,1)
rand[:, 0] = -torch.finfo(rand.dtype).max # first token should not be masked out ---> makes first <bos> token unmaskable; will be smallest p for topk
num_mask = min(int(seq * self.mask_prob), seq - 1) ---> we need to mask each token with a mask_prob probability == we can just choose randomly mask_prob ratio of numbers from the token array AND it should never exceed (seq-1) number of tokens
indices = rand.topk(num_mask, dim = -1).indices ---> topk of the random numbers are chosen to be masked (so, shouldn't this be uniform distribution according to the paper?)
mask = ~torch.zeros_like(inp).scatter(1, indices, 1.).bool() ---> creates a boolean mask
I have 2 questions :
(1) Will there ever be a case where seq-1 is bigger than int(seq * self.mask_prob) if we already have asserted the mask_prob is always <1. earlier in the code?
(2) We are masking with a probability value, doesnt it mean sometimes the model might get to see more than (1-mask_prob) tokens? But here we force the ratio throughout? And then does using normal vs uniform make any big difference?
Thanks!
The text was updated successfully, but these errors were encountered:
I think these are good questions:) The current code makes sure that a fixed percentage (mask_prob) of tokens are masked for every sequence, whereas the original paper seem to probabilistically mask mask_prob% of tokens in expectation (so the number of masked tokens can vary sequence by sequence).
My hunch is that the latter is a more general augmentation compared to the former, so may be more robust, but I haven't tested this hypothesis.
From the paper
x-transformers/x_transformers/autoregressive_wrapper.py
Lines 274 to 280 in 90cef69
I am still trying to understand the code,
I have 2 questions :
(1) Will there ever be a case where
seq-1
is bigger thanint(seq * self.mask_prob)
if we already have asserted themask_prob
is always <1. earlier in the code?(2) We are masking with a probability value, doesnt it mean sometimes the model might get to see more than (1-mask_prob) tokens? But here we force the ratio throughout? And then does using normal vs uniform make any big difference?
Thanks!
The text was updated successfully, but these errors were encountered: