huggingface · sezan92 · Dec 24, 2024
@@ -4,6 +4,10 @@
 
 As the Transformers architecture scaled well in Natural Language Processing, the same architecture was applied to images by creating small patches of the image and treating them as tokens. The result was a Vision Transformer (ViT). Before we get started with transfer learning / fine-tuning concepts, let's compare Convolutional Neural Networks (CNNs) with Vision Transformers.
 
+## Vision Transformer a Summary
+
+To summarize, in Vision transformer, we rearrange every image as a 2D grid of patches. We train the model based on those patches!
+
 ### CNN vs Vision Transformers: Inductive Bias
 
 Inductive bias is a term used in machine learning to describe the set of assumptions that a learning algorithm uses to make predictions. In simpler terms, inductive bias is like a shortcut that helps a machine learning model make educated guesses based on the information it has seen so far.