Vamsi Krishna Sankarayogi — Technologist at Heart

Why transformers need position info

A transformer processes all tokens in parallel through self-attention. There's no sequential scanning like in RNNs. This parallelism is a huge win for speed, but there's a problem: without position information, the model sees a bag of words.

The sequences "Alice ate the cake" and "cake the ate Alice" would produce identical attention patterns. The model can't distinguish word order. So transformers inject position information into the embeddings before attention.

The core challenge

How do you encode position in a way that (1) is efficient, (2) generalizes to longer sequences, and (3) helps the model learn relative positions?

🧭

Position is geometry

Different encoding methods encode position differently: some add vectors, some rotate, some bias attention directly.

?

Quick check

Why can't a transformer know word order without position encodings?

RNNs process sequences token-by-token, carrying a hidden state forward. This gives them built-in position awareness. But RNNs have a critical flaw: they're inherently sequential, so they can't be parallelized.

Parallelism matters: An RNN processing a 1000-token sequence must do 1000 sequential steps. A transformer can process all 1000 tokens at once through parallel matrix operations. This is why transformers dominate: they're orders of magnitude faster to train.

Transformers trade this speed for a tradeoff: we have to explicitly tell them where each token is. That trade is absolutely worth it—which is why we're here learning about positional encodings.

Step-by-step guided lesson

Click through six steps. Each builds on the last, from the problem all the way to modern best practices. Each step has a visualization you can interact with.

Step 1 of 6

The Problem

Why transformers need position awareness

The core issue: In a transformer, all tokens are processed in parallel through self-attention. There's no inherent notion of "order" or "position" — the model sees a bag of words.

Without position information:

The sequence "Alice ate the cake" and "cake the ate Alice" would produce identical attention patterns. The model can't distinguish word order or structure.

The solution: Add information about each token's position to its embedding before attention. Different methods do this differently. Let's explore them.

?

Quick check

What is the key advantage of sinusoidal positional encodings over learned encodings?

Four methods: detailed comparison

〰️

Sinusoidal (2017)

The original transformer approach

PE(pos, 2i) = sin(pos / 10000^(2i/d)), PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

✓ Pros

Generalizes to longer sequences
Encodes relative positions naturally
No learned parameters

✗ Cons

Somewhat arbitrary (why 10000?)
Fixed frequency patterns

Used in: Original Transformer, BERT, T5, GPT-2

🧠

Learned (Parametric)

Position as simple lookup table

One embedding vector per position (up to max length). Trained via backprop like any other parameter.

✓ Pros

Task-specific optimization
Simple implementation

✗ Cons

Fails on sequences longer than max
Can't transfer to new lengths
Extra parameters (memory)

Used in: BERT, RoBERTa (mostly historical now)

🔄

RoPE (Rotary Position Embeddings)

Rotations in 2D planes

Apply 2D rotation matrices to embedding pairs. Position is encoded as rotation angle. Relative positions become rotation differences—elegant and differentiable.

✓ Pros

Perfect extrapolation to longer sequences
Encodes relative positions naturally
No parameters, fully deterministic
Works with many attention variants

✗ Cons

Requires dimension to be even
Slightly complex to implement

Used in: LLaMA, Code Llama, Falcon, PaLM, modern SOTA models

📊

ALiBi (Attention with Linear Biases)

Distance-based attention bias

No position embeddings at all. Instead, add a distance-based bias directly to attention scores: bias(i, j) = −slope × |i − j|. Different slopes per head.

✓ Pros

Simplest to implement
Extrapolates perfectly
No position embeddings = fewer params
Works for arbitrary long sequences

✗ Cons

Less expressive than RoPE
Linear distance metric (not learned)

Used in: BLOOM, Falcon, OPT

?

Quick check

Which positional encoding method was specifically designed to extrapolate to longer sequences than training?

Interactive playground

Compare encoding methods side-by-side. Test with different sequence lengths, embedding dimensions, and positions. Watch how similarity and distance change.

Positional Encoding Playground

Compare encoding methods, test extrapolation, explore properties.

Sequence Length

16

Embedding Dimension

64

Encoding Method

sinusoidalropealibi

Test settings

Position 1

2

Position 2

8

Cosine Similarity

0.736

1 = identical, 0 = orthogonal, −1 = opposite

Euclidean Distance

4.109

How far apart in embedding space

Position Distance

6

Tokens apart

Position 2 encoding

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Showing first 16 dimensions (out of 64)

Position 8 encoding

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Showing first 16 dimensions (out of 64)

What to observe

• Similarity by distance: Increase distance between pos1 and pos2 — similarity should decrease
• Dimension size: Higher dimensions = more expressiveness but potentially slower convergence
• Long sequences: Try seq length > 512. Some methods (learned) fail; RoPE/sinusoidal work great
• Method comparison: Switch between methods to compare how they encode position information

?

Quick check

If you increase the sequence length in the playground with RoPE encoding, what happens to the cosine similarity between distant positions?

Deep dives

Absolute: The encoding itself represents position. PE(pos) is different for every position. Example: learned embeddings, sinusoidal encodings.

Relative: The encoding of relative distance between positions matters most. The model learns patterns like "word 3 positions away typically means...".

Most modern methods (RoPE, ALiBi) naturally encode relative positions. This is why they extrapolate: the relative structure is learnable and generalizes.

In sinusoidal encoding (which RoPE is based on), the base frequency 10000 is somewhat arbitrary. It was chosen empirically in the original transformer paper.

The idea: with base frequency 10000 and model dimension d=512, the longest wavelength is 10000^(0/512) × 2π = 2π (one full cycle), and the shortest is 10000^(510/512) × 2π ≈ 2π/10000 (many cycles).

This creates a nice spread of frequencies: some dimensions capture ultra-long-range patterns, others capture fine-grained local patterns. The 10000 value works well in practice, though other bases are possible.

The problem: If you train a model on sequences ≤ 2048 tokens, can it handle 4096-token sequences?

Learned embeddings: No. They have no embedding vector for position 4096.

Sinusoidal/RoPE/ALiBi: Yes! The formulas work for any position. But in practice, attention patterns learned during training might not generalize perfectly. However, these methods give you a fighting chance.

Modern techniques like position interpolation (reducing the base frequency) allow LLaMA to extrapolate even further: train on 2048, run at 32k with minimal accuracy loss.

Recent work (LLaMA 2, Code Llama) introduces position interpolation: modify the RoPE frequency to compress the position space, allowing the model to handle longer sequences.

Basic idea: If trained on positions 0–2048, interpolate new positions into that range. Instead of using raw position m, use m × (training_length / new_length).

This tricks the model into thinking it's still within its training distribution, while still encoding the actual position implicitly. Remarkably effective.

In-context learning (ICL) is when a model learns from examples in its context window. Position encoding is crucial: examples early in the context should influence the model differently than examples right before the actual task.

Different position encodings have different implicit biases. Some methods (like ALiBi) inherently favor recent tokens. Others (like RoPE) are more symmetric. This affects how well the model can utilize examples.

Recent research shows that certain position encodings can significantly improve ICL performance.

?

Quick check

How does position interpolation allow LLaMA to handle sequences much longer than its training length?

Best practices and takeaways

For new projects

Default to RoPE: It's the current standard in LLMs. Excellent generalization, elegant, used in LLaMA, Falcon, GPT-3.5+.
Consider ALiBi if: You want the simplest implementation, have tight parameter budgets, or specifically need to support very long sequences without fine-tuning.
Avoid learned embeddings if: You might encounter sequences longer than training. They don't generalize.

Implementation tips

RoPE: Apply rotations after projection to Q and K, before attention. Ensure dimension is even; handle odd dimensions by padding or special casing.
ALiBi: Compute the bias matrix once, add it to attention logits before softmax. Different slopes per head: 2^(-(h+1)×8/H).
Sinusoidal: Pre-compute PE matrix, add to embeddings before transformer layers. Cache it if possible for speed.

Key properties to remember

✓ Position encoding must be deterministic (no randomness) so inference is reproducible
✓ Relative position info should be learnable or at least captured in the representation
✓ The method should generalize to longer sequences (ideally without retraining)
✓ Implementation should be efficient (parallelizable, no sequential loops)

?

Quick check

If a transformer model uses RoPE and is trained on 4k-token sequences, why can it often work on 8k-token sequences without retraining?

Positional Encoding: How transformers know where they are

Why transformers need position info

Step-by-step guided lesson

The Problem

Four methods: detailed comparison

Sinusoidal (2017)

Learned (Parametric)

RoPE (Rotary Position Embeddings)

ALiBi (Attention with Linear Biases)

Interactive playground

Positional Encoding Playground

Deep dives

Best practices and takeaways

For new projects

Implementation tips

Key properties to remember

Next steps

Finished this lesson?