[Minor; noob question] Uniform distribution instead of normal #232

p0p4k · 2024-01-21T19:27:52Z

From the paper

x-transformers/x_transformers/autoregressive_wrapper.py

Lines 274 to 280 in 90cef69

    
           if self.mask_prob > 0.: 
        
               rand = torch.randn(inp.shape, device = x.device) 
        
               rand[:, 0] = -torch.finfo(rand.dtype).max # first token should not be masked out 
        
               num_mask = min(int(seq * self.mask_prob), seq - 1) 
        
               indices = rand.topk(num_mask, dim = -1).indices 
        
               mask = ~torch.zeros_like(inp).scatter(1, indices, 1.).bool() 
        
               kwargs.update(self_attn_kv_mask = mask)

I am still trying to understand the code,

rand = torch.randn(inp.shape, device = x.device) ---> creates a random array of normal dist number (0,1)
rand[:, 0] = -torch.finfo(rand.dtype).max # first token should not be masked out  ---> makes first <bos> token unmaskable; will be smallest p for topk
num_mask = min(int(seq * self.mask_prob), seq - 1) ---> we need to mask each token with a mask_prob probability == we can just choose randomly mask_prob ratio of numbers from the token array AND it should never exceed (seq-1) number of tokens
indices = rand.topk(num_mask, dim = -1).indices ---> topk of the random numbers are chosen to be masked (so, shouldn't this be uniform distribution according to the paper?)
mask = ~torch.zeros_like(inp).scatter(1, indices, 1.).bool() --->  creates a boolean mask

I have 2 questions :
(1) Will there ever be a case where seq-1 is bigger than int(seq * self.mask_prob) if we already have asserted the mask_prob is always <1. earlier in the code?
(2) We are masking with a probability value, doesnt it mean sometimes the model might get to see more than (1-mask_prob) tokens? But here we force the ratio throughout? And then does using normal vs uniform make any big difference?
Thanks!

The text was updated successfully, but these errors were encountered:

shuishida · 2024-11-11T14:32:53Z

I think these are good questions:) The current code makes sure that a fixed percentage (mask_prob) of tokens are masked for every sequence, whereas the original paper seem to probabilistically mask mask_prob% of tokens in expectation (so the number of masked tokens can vary sequence by sequence).

My hunch is that the latter is a more general augmentation compared to the former, so may be more robust, but I haven't tested this hypothesis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Minor; noob question] Uniform distribution instead of normal #232

[Minor; noob question] Uniform distribution instead of normal #232

p0p4k commented Jan 21, 2024

shuishida commented Nov 11, 2024

[Minor; noob question] Uniform distribution instead of normal #232

[Minor; noob question] Uniform distribution instead of normal #232

Comments

p0p4k commented Jan 21, 2024

shuishida commented Nov 11, 2024