Positional Encoding: How transformers know where they are
Transformers process all tokens in parallel, but they need to know positions. Explore four modern encoding methods— from sinusoidal waves to rotations to linear biases—and see how they fundamentally shape the attention mechanism.
Why transformers need position info
A transformer processes all tokens in parallel through self-attention. There's no sequential scanning like in RNNs. This parallelism is a huge win for speed, but there's a problem: without position information, the model sees a bag of words.
The sequences "Alice ate the cake" and "cake the ate Alice" would produce identical attention patterns. The model can't distinguish word order. So transformers inject position information into the embeddings before attention.
The core challenge
How do you encode position in a way that (1) is efficient, (2) generalizes to longer sequences, and (3) helps the model learn relative positions?
Position is geometry
Different encoding methods encode position differently: some add vectors, some rotate, some bias attention directly.
Quick check
Why can't a transformer know word order without position encodings?
RNNs process sequences token-by-token, carrying a hidden state forward. This gives them built-in position awareness. But RNNs have a critical flaw: they're inherently sequential, so they can't be parallelized.
Parallelism matters: An RNN processing a 1000-token sequence must do 1000 sequential steps. A transformer can process all 1000 tokens at once through parallel matrix operations. This is why transformers dominate: they're orders of magnitude faster to train.
Transformers trade this speed for a tradeoff: we have to explicitly tell them where each token is. That trade is absolutely worth it—which is why we're here learning about positional encodings.
Step-by-step guided lesson
Click through six steps. Each builds on the last, from the problem all the way to modern best practices. Each step has a visualization you can interact with.
The Problem
Why transformers need position awareness
The core issue: In a transformer, all tokens are processed in parallel through self-attention. There's no inherent notion of "order" or "position" — the model sees a bag of words.
Without position information:
The sequence "Alice ate the cake" and "cake the ate Alice" would produce identical attention patterns. The model can't distinguish word order or structure.
The solution: Add information about each token's position to its embedding before attention. Different methods do this differently. Let's explore them.
Quick check
What is the key advantage of sinusoidal positional encodings over learned encodings?
Four methods: detailed comparison
Sinusoidal (2017)
The original transformer approach
PE(pos, 2i) = sin(pos / 10000^(2i/d)), PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
✓ Pros
- Generalizes to longer sequences
- Encodes relative positions naturally
- No learned parameters
✗ Cons
- Somewhat arbitrary (why 10000?)
- Fixed frequency patterns
Used in: Original Transformer, BERT, T5, GPT-2
Learned (Parametric)
Position as simple lookup table
One embedding vector per position (up to max length). Trained via backprop like any other parameter.
✓ Pros
- Task-specific optimization
- Simple implementation
✗ Cons
- Fails on sequences longer than max
- Can't transfer to new lengths
- Extra parameters (memory)
Used in: BERT, RoBERTa (mostly historical now)
RoPE (Rotary Position Embeddings)
Rotations in 2D planes
Apply 2D rotation matrices to embedding pairs. Position is encoded as rotation angle. Relative positions become rotation differences—elegant and differentiable.
✓ Pros
- Perfect extrapolation to longer sequences
- Encodes relative positions naturally
- No parameters, fully deterministic
- Works with many attention variants
✗ Cons
- Requires dimension to be even
- Slightly complex to implement
Used in: LLaMA, Code Llama, Falcon, PaLM, modern SOTA models
ALiBi (Attention with Linear Biases)
Distance-based attention bias
No position embeddings at all. Instead, add a distance-based bias directly to attention scores: bias(i, j) = −slope × |i − j|. Different slopes per head.
✓ Pros
- Simplest to implement
- Extrapolates perfectly
- No position embeddings = fewer params
- Works for arbitrary long sequences
✗ Cons
- Less expressive than RoPE
- Linear distance metric (not learned)
Used in: BLOOM, Falcon, OPT
Quick check
Which positional encoding method was specifically designed to extrapolate to longer sequences than training?
Interactive playground
Compare encoding methods side-by-side. Test with different sequence lengths, embedding dimensions, and positions. Watch how similarity and distance change.
Positional Encoding Playground
Compare encoding methods, test extrapolation, explore properties.
16
64
Encoding Method
Test settings
2
8
Cosine Similarity
0.736
1 = identical, 0 = orthogonal, −1 = opposite
Euclidean Distance
4.109
How far apart in embedding space
Position Distance
6
Tokens apart
Position 2 encoding
Showing first 16 dimensions (out of 64)
Position 8 encoding
Showing first 16 dimensions (out of 64)
What to observe
- • Similarity by distance: Increase distance between pos1 and pos2 — similarity should decrease
- • Dimension size: Higher dimensions = more expressiveness but potentially slower convergence
- • Long sequences: Try seq length > 512. Some methods (learned) fail; RoPE/sinusoidal work great
- • Method comparison: Switch between methods to compare how they encode position information
Quick check
If you increase the sequence length in the playground with RoPE encoding, what happens to the cosine similarity between distant positions?
Deep dives
Absolute: The encoding itself represents position. PE(pos) is different for every position. Example: learned embeddings, sinusoidal encodings.
Relative: The encoding of relative distance between positions matters most. The model learns patterns like "word 3 positions away typically means...".
Most modern methods (RoPE, ALiBi) naturally encode relative positions. This is why they extrapolate: the relative structure is learnable and generalizes.
In sinusoidal encoding (which RoPE is based on), the base frequency 10000 is somewhat arbitrary. It was chosen empirically in the original transformer paper.
The idea: with base frequency 10000 and model dimension d=512, the longest wavelength is 10000^(0/512) × 2π = 2π (one full cycle), and the shortest is 10000^(510/512) × 2π ≈ 2π/10000 (many cycles).
This creates a nice spread of frequencies: some dimensions capture ultra-long-range patterns, others capture fine-grained local patterns. The 10000 value works well in practice, though other bases are possible.
The problem: If you train a model on sequences ≤ 2048 tokens, can it handle 4096-token sequences?
Learned embeddings: No. They have no embedding vector for position 4096.
Sinusoidal/RoPE/ALiBi: Yes! The formulas work for any position. But in practice, attention patterns learned during training might not generalize perfectly. However, these methods give you a fighting chance.
Modern techniques like position interpolation (reducing the base frequency) allow LLaMA to extrapolate even further: train on 2048, run at 32k with minimal accuracy loss.
Recent work (LLaMA 2, Code Llama) introduces position interpolation: modify the RoPE frequency to compress the position space, allowing the model to handle longer sequences.
Basic idea: If trained on positions 0–2048, interpolate new positions into that range. Instead of using raw position m, use m × (training_length / new_length).
This tricks the model into thinking it's still within its training distribution, while still encoding the actual position implicitly. Remarkably effective.
In-context learning (ICL) is when a model learns from examples in its context window. Position encoding is crucial: examples early in the context should influence the model differently than examples right before the actual task.
Different position encodings have different implicit biases. Some methods (like ALiBi) inherently favor recent tokens. Others (like RoPE) are more symmetric. This affects how well the model can utilize examples.
Recent research shows that certain position encodings can significantly improve ICL performance.
Quick check
How does position interpolation allow LLaMA to handle sequences much longer than its training length?
Best practices and takeaways
For new projects
- Default to RoPE: It's the current standard in LLMs. Excellent generalization, elegant, used in LLaMA, Falcon, GPT-3.5+.
- Consider ALiBi if: You want the simplest implementation, have tight parameter budgets, or specifically need to support very long sequences without fine-tuning.
- Avoid learned embeddings if: You might encounter sequences longer than training. They don't generalize.
Implementation tips
- RoPE: Apply rotations after projection to Q and K, before attention. Ensure dimension is even; handle odd dimensions by padding or special casing.
- ALiBi: Compute the bias matrix once, add it to attention logits before softmax. Different slopes per head: 2^(-(h+1)×8/H).
- Sinusoidal: Pre-compute PE matrix, add to embeddings before transformer layers. Cache it if possible for speed.
Key properties to remember
- ✓ Position encoding must be deterministic (no randomness) so inference is reproducible
- ✓ Relative position info should be learnable or at least captured in the representation
- ✓ The method should generalize to longer sequences (ideally without retraining)
- ✓ Implementation should be efficient (parallelizable, no sequential loops)
Quick check
If a transformer model uses RoPE and is trained on 4k-token sequences, why can it often work on 8k-token sequences without retraining?
Next steps
Now that you understand positional encoding, dive deeper into how it integrates with attention: