Vamsi Krishna Sankarayogi — Technologist at Heart

Autoregressive generation is expensive

Every large language model generates text token by token. For each new token:

1.Run a forward pass through the entire 7B-to-70B parameter model
2.Compute attention: each token looks at all previous tokens
3.At token 500, you must recompute Key and Value for tokens 0-499, then do it all again at token 501

The bottleneck

Recomputing K and V for the same tokens over and over. Without caching, this scales as O(n²) per forward pass. At sequence length 4096, you're doing 16 million redundant attention operations per new token.

Generation Step: 1 / 8

Use KV Cache

Current token

token_1

Computing query, key, value projections...

KV Cache

Cached tokens:0

Memory:0.00 MB

Cache contents

t1

...

Computation breakdown

K/V reuse (cache hits):0

New K/V (recomputed):1

Relative compute:1.0x

What's happening

Cache hit: We're reusing the 0 cached K/V pairs from previous tokens. Only the new token's query needs to be computed and matched against all cached keys. This scales linearly, not quadratically.

Summary at step 1

Sequence length

1

Cache size

0.0 MB

Speed gain

0%

?

Quick check

Why does recomputing K and V for every token get more expensive as the sequence grows?

The full attention matrix has shape (seq_len, seq_len) — quadratic space! For a 4K context, that's 16 million entries. You'd still run out of memory.

K and V are much smaller: (seq_len, d_model). For a 4K context and 768-dimensional embeddings, that's just 4K × 768 = 3M entries. Linear in sequence length, not quadratic.

The trick: attention is decomposable. When a new token arrives, you don't recompute attention from scratch. You compute its query Q, match it against cached K, and blend cached V. The old attention scores (between old tokens) don't need to change.

Step-by-step walkthrough

Follow the 6-step guided lesson to understand how KV cache works, from autoregressive generation to speculative decoding to real-world batching:

Step 1 of 6

Autoregressive generation

How LLMs generate text one token at a time

Large language models don't generate entire outputs at once. Instead, they use autoregressive decoding: predict one token, feed it back as input, predict the next token, and repeat.

Each step requires a forward pass through the entire model, including the attention layers. For a model with billions of parameters, even one forward pass is expensive.

The problem: as the sequence grows, the cost doesn't just scale linearly. Attention is O(n²) because each new token must attend to all previous tokens.

Key points

•Output at step N depends on tokens 0...N-1
•Must recompute K and V for all past tokens each step
•Quadratic scaling is costly at long sequences

Watch token generation unfold. Observe how each step requires processing all previous tokens.

Generation Step: 1 / 8

Use KV Cache

Current token

token_1

Computing query, key, value projections...

KV Cache

Cached tokens:0

Memory:0.00 MB

Cache contents

t1

...

Computation breakdown

K/V reuse (cache hits):0

New K/V (recomputed):1

Relative compute:1.0x

What's happening

Cache hit: We're reusing the 0 cached K/V pairs from previous tokens. Only the new token's query needs to be computed and matched against all cached keys. This scales linearly, not quadratically.

Summary at step 1

Sequence length

1

Cache size

0.0 MB

Speed gain

0%

Step 1 / 6

?

Quick check

In the KV cache, what gets stored and reused?

The memory-speed tradeoff: always worth it

KV cache trades GPU memory (linear in sequence) for compute time (quadratic reduction). See the dramatic memory savings at longer sequences:

Without KV Cache

With KV Cache

Memory savings at key lengths

Seq length: 17

0% saved

102.0KB → 102.0KB

Seq length: 33

0% saved

198.0KB → 198.0KB

Seq length: 49

0% saved

294.0KB → 294.0KB

Why cache helps

Without the KV cache, memory grows quadratically: you must store the full attention matrix for each token. With caching, you only store K and V vectors (linear growth). At sequence length 64, you save about 0% of memory, and the difference gets even more dramatic at longer sequences. This is why KV cache is essential for long-context models.

Without KV Cache

Memory: Attention matrix per token (quadratic)
Compute: O(n²) per new token
Result: Infeasible for long sequences

With KV Cache

Memory: K, V vectors (linear)
Compute: O(n) per new token
Result: Fast inference at 4K+ context

?

Quick check

At sequence length 4096 with embedding size 768, roughly how much GPU memory does the KV cache use (per batch)?

vLLM and TGI use paged attention: the KV cache is split into fixed-size blocks (pages), like virtual memory in an OS. This enables:

•Memory efficiency: Free pages can be reused across different requests without fragmentation
•Sharing: Multiple requests can share cached prefixes (e.g., a system prompt)
•Preemption: Low-priority requests can have their cache paged out temporarily

This is a major reason production inference engines are so much faster than naive implementations.

Going further: speculative decoding

Even with KV cache, generating one token at a time is slow. Modern inference engines use speculative decoding: a small draft model proposes multiple candidates, and the main model verifies them in parallel using KV cache. This can yield 1.5-3× speedups.

Draft Model Size

Smaller = faster draft but fewer accepted tokens

Number of candidates: 4

Acceptance rate: 70%

How many draft candidates the main model accepts

1

Draft Phase

Draft model (tiny — ~5% of main model) rapidly generates 4 candidate tokens:

c0

candidate_0

100%

c1

candidate_1

85%

c2

candidate_2

70%

c3

candidate_3

55%

2

Verification Phase

Main model evaluates all 4 candidates in parallel with KV cache to reject/accept:

Theoretical speedup

Without speculation

4×

Need 4 separate main model forward passes

With speculation

3.8×

1 draft + 1 main pass (parallel)

Overall speedup: 0.95×

Actual speedup depends on how many candidates are accepted. More accepted = closer to 3.8× speedup. Fewer accepted = more wasted draft passes.

How it works

1.Draft: Small, fast model generates 4 candidate next tokens using KV cache
2.Verify: Large main model processes all candidates in a single batched forward pass with KV cache
3.Accept: Keep candidates the main model agrees with; reroll rejected ones

The KV cache is crucial here — it lets the main model verify multiple candidates with minimal overhead.

Draft Model

Small, fast (1-2B params). Generates candidate tokens in milliseconds. Quality is secondary.

Batch Verify

All candidates processed in one forward pass with KV cache. Rejects candidates main model disagrees with.

Speedup

Typically 1.5-3×, depending on draft quality and main model speed.

?

Quick check

Why does speculative decoding require KV cache to work well?

If the main model disagrees with the draft model, you fall back: sample one token from the main model's distribution and continue. This is rare when the draft model is well-tuned.

There's a tradeoff: larger draft models accept more candidates but are slower. Smaller draft models are fast but waste compute on rejected candidates. In practice, a 7B draft + 70B main is common.

Importantly, speculative decoding preserves the main model's distribution: you always get the same output as if you ran the main model autoregressively, just faster.

Real-world inference: batching and serving

Production systems don't serve one user at a time. They batch hundreds of requests and share compute. KV cache enables this efficiently:

Per-request cache

Each user's request has its own K/V cache. No cache interference between users.

Batched forward pass

One forward pass processes multiple users' queries simultaneously. Amortizes model overhead.

Continuous batching

Requests enter/leave the batch as they arrive. No waiting for a full batch to form.

Example: 32-batch serving 7B model

Throughput (KV cache):~3000 tokens/sec

Per-token latency:~10ms

GPU memory:~20GB (model + batch KV cache)

Requests served/sec:~60 (batch size 32 × 2 Hz)

?

Quick check

Why is continuous batching better than waiting for a full batch in production?

Standard implementation: each request gets a fixed-size cache buffer. If one request is shorter, that buffer space is wasted. Paged cache divides memory into blocks (e.g., 16-token pages).

Benefits:

•No waste: Only allocate pages a request actually needs
•Prefix sharing: Multiple requests with the same prefix (system prompt) share cache pages
•Preemption: Low-priority requests can release pages to high-priority ones

vLLM's paged attention is a major reason it's 10-20× faster than naive implementations.

Interactive playground

Adjust parameters and see real-world throughput, latency, and memory estimates. Understand how different configurations affect inference performance.

Key takeaway

KV cache transforms LLM inference from quadratic to near-linear by caching and reusing key and value vectors. This single optimization enables fast generation, long-context processing, and high-throughput serving.

Understanding KV cache is essential for reasoning about inference performance, memory constraints, and system design. Every production LLM deployment relies on it. Next, explore vLLM, Text Generation Inference, or llama.cpp source code to see how real systems implement this at scale.

KV Cache & Inference — why LLMs generate fast

Autoregressive generation is expensive

Step-by-step walkthrough

Autoregressive generation

The memory-speed tradeoff: always worth it

Going further: speculative decoding

Real-world inference: batching and serving

Interactive playground

Finished this lesson?