How attention actually works — the math, made visual
Every time you use ChatGPT, Google Translate, or a code assistant, attention is running under the hood. It's the mechanism that lets a model figure out which words matter most to each other. This lesson breaks it down into eight digestible steps with live interactive visualizations.
Why attention exists
Older models like RNNs read a sentence one word at a time, left to right, carrying a single "memory" forward. That works until the sentence gets long — by the time the model reaches the last word, it has mostly forgotten the first one.
Attention solves this by letting every word look at every other word at once. Instead of a single narrow memory, each word gets a custom summary of the entire sentence, weighted by relevance. The word "she" can directly check back to "Alice" in a single step, even if they're 10 words apart.
The one-sentence version
Attention lets each word ask: "Of all the other words around me, which ones should I pay the most attention to right now?" — and then blend their information into its own representation.
Live attention flow
Click a word to see where it sends its attention
"she" pays the most attention to "a" (68.9%) and the least to "question" (0.5%)
In an RNN, information passes through a bottleneck: every piece of context must be compressed into a single hidden-state vector at each step. If the sentence is 50 words long, the model at position 50 is trying to recall position 1 through a chain of 49 lossy transformations.
Attention bypasses that chain entirely. Position 50 can read position 1 in a single matrix multiplication. There's no information bottleneck — every word has a direct line to every other word. This is also why transformers train much faster: all positions can be computed in parallel instead of sequentially.
The trade-off is compute: attention costs O(n²) per layer (every word compared to every word), while RNNs are O(n). Modern variants like FlashAttention and sparse attention mitigate this, and it remains a good trade for most practical sequence lengths.
Quick check
In the sentence "Alice emailed Bob because she had a question", which word most needs to attend to a distant word to be understood correctly?
The one formula you need
Despite its fame, the core attention math fits on a single line. Everything else in a transformer is just making this line run faster, in parallel, or on longer sequences. Click each part of the formula to see what it does:
Click any part to explore it
Tap Q, K, V, softmax, or any symbol above
Q — the Question
Each word generates a query: "I need context about X." Think of it as a search query typed into Google — it describes what you're looking for.
K — the Label
Each word also generates a key: "Here's what I'm about." The search engine matches queries against these labels to produce a relevance score.
V — the Content
When a match is found, you don't get the label back — you get the value. Like clicking a search result and reading the actual page.
√dk — the Volume Knob
Without this, big vectors produce enormous dot products and softmax collapses to a hard argmax. Scaling keeps things smooth and trainable.
If we simply picked the highest-scoring key (argmax), attention would be "hard" — each word would look at exactly one other word. The model couldn't express uncertainty or blend information from multiple sources.
Softmax gives us a smooth distribution. A word can pay 60% attention to one word, 25% to another, and 15% to a third. This soft blending is differentiable (crucial for gradient descent) and much more expressive — the model can learn nuanced relationships, not just binary choices.
An additional benefit: even low-scoring words contribute a small amount, which acts as a form of regularization and helps the model stay robust to small input changes.
Quick check
What would happen if you removed the scaling factor (√dₖ) from the attention formula?
A friendlier way to think about it
Imagine you're at a conference with dozens of people. You can't talk to everyone equally, so you scan their name badges (keys), decide who's most relevant to what you need (query), and then actually listen to those people (values). The more relevant they are, the more attention you give them. By the end, you walk away with a blended summary weighted toward the most useful conversations.
1. Broadcast a query
"I'm the word she — who around me resolves what I refer to?"
2. Score the matches
Alice scores high (noun, female) while because scores low (conjunction, no entity).
3. Blend the results
The output for she is now mostly composed of information from Alice. Pronoun resolved.
Raw embeddings encode everything about a word — its meaning, part of speech, position, etc. But the question "who should I attend to?" is different from "what information do I carry?"
Separate projections let the model decouple these roles. The word "she" might project a query that emphasizes "looking for a female noun" while its value projection emphasizes "3rd person singular pronoun." Without separate projections, matching and retrieval would be stuck using the same features.
This is also why multi-head attention is so powerful — each head can learn completely different Q, K, V projections, so one head might match on semantics while another matches on syntax or position.
See it compute in real time
This is the hands-on part. Step through the entire attention computation one stage at a time. Pick a sentence, click on a word to set it as the "query" (the word doing the looking), and watch the numbers flow from raw scores to final output.
Tip: try the preset sentences — each highlights something different. "Who does she mean?" shows pronoun resolution. "Which bank?" shows word disambiguation from context.
Step 1 — Break it into pieces
Before the model can do anything, the sentence needs to be chopped into tokens. Real models use clever sub-word splitting (so "unhappily" might become "un" + "happi" + "ly"), but plain whitespace works fine for seeing the math.
Tokens (8)
Quick check
After softmax, each row of the attention matrix sums to 1. What does a single row represent?
How production transformers go further
The walkthrough above shows the essential computation. Real models like GPT-4 or Llama wrap it with a few extra pieces that make training stable and expressive:
Multi-head attention
Instead of one attention pass, run 8, 16, or even 128 in parallel. Each "head" can specialize: one might track syntax, another coreference, another positional patterns. You explored this in Step 8 of the walkthrough.
Residual connections
The output of each layer is added back to its input: x + Attention(x). This "skip connection" lets gradients flow directly through 100+ layers and prevents the model from losing the original signal.
Feed-forward network (MLP)
After attention mixes information between words, a per-word MLP processes each position independently. Research suggests this is where much of the model's factual knowledge gets stored.
Positional encoding
Attention by itself doesn't know word order — "dog bites man" and "man bites dog" produce the same scores. Positional encodings (sinusoidal, learned, or rotary) inject order information so the model knows where each word sits.
Each head produces its own output matrix (one per token). These are concatenated side by side into a wide matrix, then multiplied by a learned "projection" matrix WO that compresses them back to the model's dimension. So if you have 8 heads each producing d/8 dimensions, you get 8 × (d/8) = d before projection, and the output WO maps it to d again.
This means the model can learn which combinations of head outputs are most useful. Some heads might contribute more to certain types of tokens, and WO blends their signals accordingly.
Quick check
Why does attention treat "dog bites man" the same as "man bites dog" without positional encoding?
Open playground
Done with the guided lesson? This playground gives you full control. Try your own sentences, tweak the number of attention heads, adjust embedding dimensions, and compare how different heads focus on different patterns.
Key takeaway
Attention is fundamentally a soft lookup table. Queries search, keys get matched, values get retrieved. Softmax makes it differentiable so gradient descent can learn which lookups are useful. Everything else — multi-head, residual connections, positional encoding — is engineering to make this core idea work at scale.