Interactive lesson~18 minIntermediate
Vision Transformers
Vision Transformers treat images as token sequences. Instead of sliding filters over pixels, they split an image into patches and let attention connect them.
ViTPatch embeddingsDeiT
Mental model
An image becomes a sentence of patches.
ViTs power modern vision backbones, multimodal encoders, segmentation systems, and image-language models.
Detail capture
balanced69% modeled signal
Compute cost
balanced66% modeled signal
Robustness
balanced54% modeled signal
Concept pipeline
Build the idea in four moves
Interactive lab
Balance visual detail against compute.
Patch
Split the image into fixed-size tiles.
Focus lens
The part that usually clicks late
Patch size
Smaller patches preserve detail but increase token count.
Detail capture
69
Compute cost
66
Robustness
54
Knowledge check
What is the core ViT move?
Next horizon
Where this topic is headed
Masked autoencoders
DINO-style self-supervision
Vision-language encoders