A transformer implementation for images: Steps include:
- Patch embedding
- Attention Module
- Multi layer perceptron
- Transformer block
- Final Vision Transformer
Pretraining using ViT on Imagenet dataset and tested on new image.
Additional topics:
- How linear layer behave, when we have 3-dim or more dim tensors.
- Basic Property of Layer Normalization.
Inspired from https://www.youtube.com/watch?v=ovB0ddFtzzA