KV Cache & Inference — why LLMs generate fast
Generating text from a billion-parameter model feels instant on your laptop. The secret: KV cache. This lesson reveals how reusing cached keys and values transforms inference from quadratic to near-linear, and how production systems leverage it to serve hundreds of users concurrently.
Autoregressive generation is expensive
Every large language model generates text token by token. For each new token:
- 1.Run a forward pass through the entire 7B-to-70B parameter model
- 2.Compute attention: each token looks at all previous tokens
- 3.At token 500, you must recompute Key and Value for tokens 0-499, then do it all again at token 501
The bottleneck
Recomputing K and V for the same tokens over and over. Without caching, this scales as O(n²) per forward pass. At sequence length 4096, you're doing 16 million redundant attention operations per new token.
Current token
Computing query, key, value projections...
KV Cache
Cache contents
Computation breakdown
What's happening
Cache hit: We're reusing the 0 cached K/V pairs from previous tokens. Only the new token's query needs to be computed and matched against all cached keys. This scales linearly, not quadratically.
Summary at step 1
Sequence length
1
Cache size
0.0 MB
Speed gain
0%
Quick check
Why does recomputing K and V for every token get more expensive as the sequence grows?
The full attention matrix has shape (seq_len, seq_len) — quadratic space! For a 4K context, that's 16 million entries. You'd still run out of memory.
K and V are much smaller: (seq_len, d_model). For a 4K context and 768-dimensional embeddings, that's just 4K × 768 = 3M entries. Linear in sequence length, not quadratic.
The trick: attention is decomposable. When a new token arrives, you don't recompute attention from scratch. You compute its query Q, match it against cached K, and blend cached V. The old attention scores (between old tokens) don't need to change.
Step-by-step walkthrough
Follow the 6-step guided lesson to understand how KV cache works, from autoregressive generation to speculative decoding to real-world batching:
Step 1 of 6
Autoregressive generation
How LLMs generate text one token at a time
Large language models don't generate entire outputs at once. Instead, they use autoregressive decoding: predict one token, feed it back as input, predict the next token, and repeat.
Each step requires a forward pass through the entire model, including the attention layers. For a model with billions of parameters, even one forward pass is expensive.
The problem: as the sequence grows, the cost doesn't just scale linearly. Attention is O(n²) because each new token must attend to all previous tokens.
Key points
- •Output at step N depends on tokens 0...N-1
- •Must recompute K and V for all past tokens each step
- •Quadratic scaling is costly at long sequences
Watch token generation unfold. Observe how each step requires processing all previous tokens.
Current token
Computing query, key, value projections...
KV Cache
Cache contents
Computation breakdown
What's happening
Cache hit: We're reusing the 0 cached K/V pairs from previous tokens. Only the new token's query needs to be computed and matched against all cached keys. This scales linearly, not quadratically.
Summary at step 1
Sequence length
1
Cache size
0.0 MB
Speed gain
0%
Quick check
In the KV cache, what gets stored and reused?
The memory-speed tradeoff: always worth it
KV cache trades GPU memory (linear in sequence) for compute time (quadratic reduction). See the dramatic memory savings at longer sequences:
Memory savings at key lengths
Seq length: 17
0% saved
102.0KB → 102.0KB
Seq length: 33
0% saved
198.0KB → 198.0KB
Seq length: 49
0% saved
294.0KB → 294.0KB
Why cache helps
Without the KV cache, memory grows quadratically: you must store the full attention matrix for each token. With caching, you only store K and V vectors (linear growth). At sequence length 64, you save about 0% of memory, and the difference gets even more dramatic at longer sequences. This is why KV cache is essential for long-context models.
Without KV Cache
- Memory: Attention matrix per token (quadratic)
- Compute: O(n²) per new token
- Result: Infeasible for long sequences
With KV Cache
- Memory: K, V vectors (linear)
- Compute: O(n) per new token
- Result: Fast inference at 4K+ context
Quick check
At sequence length 4096 with embedding size 768, roughly how much GPU memory does the KV cache use (per batch)?
vLLM and TGI use paged attention: the KV cache is split into fixed-size blocks (pages), like virtual memory in an OS. This enables:
- •Memory efficiency: Free pages can be reused across different requests without fragmentation
- •Sharing: Multiple requests can share cached prefixes (e.g., a system prompt)
- •Preemption: Low-priority requests can have their cache paged out temporarily
This is a major reason production inference engines are so much faster than naive implementations.
Going further: speculative decoding
Even with KV cache, generating one token at a time is slow. Modern inference engines use speculative decoding: a small draft model proposes multiple candidates, and the main model verifies them in parallel using KV cache. This can yield 1.5-3× speedups.
Smaller = faster draft but fewer accepted tokens
How many draft candidates the main model accepts
Draft Phase
Draft model (tiny — ~5% of main model) rapidly generates 4 candidate tokens:
candidate_0
candidate_1
candidate_2
candidate_3
Verification Phase
Main model evaluates all 4 candidates in parallel with KV cache to reject/accept:
Theoretical speedup
Without speculation
4×
Need 4 separate main model forward passes
With speculation
3.8×
1 draft + 1 main pass (parallel)
Overall speedup: 0.95×
Actual speedup depends on how many candidates are accepted. More accepted = closer to 3.8× speedup. Fewer accepted = more wasted draft passes.
How it works
- 1.Draft: Small, fast model generates 4 candidate next tokens using KV cache
- 2.Verify: Large main model processes all candidates in a single batched forward pass with KV cache
- 3.Accept: Keep candidates the main model agrees with; reroll rejected ones
The KV cache is crucial here — it lets the main model verify multiple candidates with minimal overhead.
Draft Model
Small, fast (1-2B params). Generates candidate tokens in milliseconds. Quality is secondary.
Batch Verify
All candidates processed in one forward pass with KV cache. Rejects candidates main model disagrees with.
Speedup
Typically 1.5-3×, depending on draft quality and main model speed.
Quick check
Why does speculative decoding require KV cache to work well?
If the main model disagrees with the draft model, you fall back: sample one token from the main model's distribution and continue. This is rare when the draft model is well-tuned.
There's a tradeoff: larger draft models accept more candidates but are slower. Smaller draft models are fast but waste compute on rejected candidates. In practice, a 7B draft + 70B main is common.
Importantly, speculative decoding preserves the main model's distribution: you always get the same output as if you ran the main model autoregressively, just faster.
Real-world inference: batching and serving
Production systems don't serve one user at a time. They batch hundreds of requests and share compute. KV cache enables this efficiently:
Per-request cache
Each user's request has its own K/V cache. No cache interference between users.
Batched forward pass
One forward pass processes multiple users' queries simultaneously. Amortizes model overhead.
Continuous batching
Requests enter/leave the batch as they arrive. No waiting for a full batch to form.
Example: 32-batch serving 7B model
Quick check
Why is continuous batching better than waiting for a full batch in production?
Standard implementation: each request gets a fixed-size cache buffer. If one request is shorter, that buffer space is wasted. Paged cache divides memory into blocks (e.g., 16-token pages).
Benefits:
- •No waste: Only allocate pages a request actually needs
- •Prefix sharing: Multiple requests with the same prefix (system prompt) share cache pages
- •Preemption: Low-priority requests can release pages to high-priority ones
vLLM's paged attention is a major reason it's 10-20× faster than naive implementations.
Interactive playground
Adjust parameters and see real-world throughput, latency, and memory estimates. Understand how different configurations affect inference performance.
Key takeaway
KV cache transforms LLM inference from quadratic to near-linear by caching and reusing key and value vectors. This single optimization enables fast generation, long-context processing, and high-throughput serving.
Understanding KV cache is essential for reasoning about inference performance, memory constraints, and system design. Every production LLM deployment relies on it. Next, explore vLLM, Text Generation Inference, or llama.cpp source code to see how real systems implement this at scale.