Mixture of Experts — scaling models through sparse routing
Most large language models use dense forward passes: every token touches every parameter. Mixture of Experts (MoE) changes this with a router network that sends each token to only a few specialized experts. This unlocks massive scale—GPT-3–sized capacity with far less compute. Learn how routing, gating, and load balancing work together to make it practical.
Why Mixture of Experts?
Modern language models are massive. GPT-3 has 175 billion parameters. Each forward pass requires computing gradients for all of them. This is why training costs millions of dollars.
Mixture of Experts sidesteps this. Instead of routing every token through every parameter, we add a router network that learns to send each token to only a few specialized experts. A token might go to the "code expert" and "reasoning expert", skipping the "poetry expert."
The result: scale to 1 trillion parameters while keeping compute costs reasonable. Google's Switch Transformer uses this to achieve 7x better efficiency than dense models.
The tradeoff
Dense models: Every parameter is always active, but compute is predictable. MoE: Only a small fraction activates per token, but routing adds complexity and requires careful load balancing to avoid bottlenecks.
Dense: All parameters active
100% parameter utilization
MoE: Only selected experts active
12% parameter utilization (2 of 16)
The compute savings depend on the ratio of experts to top-k selections. With 128 experts and top-k=2, each token activates only 2/128 = 1.5% of parameters. That's a potential 66x speedup.
In practice, it's less. Routing adds overhead, synchronization costs increase, and not all experts are equally loaded. But even with these costs, you get 5-10x efficiency gains.
Real systems like Switch Transformer show 7x better efficiency (FLOPs per token) compared to dense models like GPT-3, while maintaining similar or better accuracy.
Quick check
What's the main advantage of Mixture of Experts over dense models?
The router: How tokens get dispatched
The core of MoE is the router network—a small learned neural network that decides which experts receive each token. Let's see how it works.
Router: Token → Expert Selection
Input token (position 5)
Gating network scores
Softmax probabilities
Top-2 selection (dispatch to experts)
Experts activated
2 / 8
Sparsity
75%
Top expert score
-Infinity
How it works: The router network computes scores for each expert. These scores go through softmax to become probabilities. The top-2 experts receive the token. This sparse routing saves computation by only activating 25% of the experts.
Raw scores can be negative, vary wildly, and don't sum to 1. Softmax normalizes them to probabilities: all in [0, 1] and summing to 1. This is essential for top-k selection—we need to know which k experts score highest.
Softmax is differentiable everywhere, so gradients flow back to the router weights during training. This lets the router learn which features of the input should activate which experts.
Quick check
In the router, what does top-k selection mean?
Expert load: Distribution and balance
As tokens flow through the MoE layer, some experts get more traffic than others. This creates a subtle but critical problem: load imbalance can destroy efficiency.
Expert Grid: Token flow & load distribution
Expert utilization (brightness = more tokens)
E0
0 tokens
E1
0 tokens
E2
0 tokens
E3
0 tokens
E4
0 tokens
E5
0 tokens
E6
0 tokens
E7
0 tokens
Token routing sequence (token 0 / 16)
Load distribution (max 1 tokens)
Most loaded
E-1
Least loaded
E-1
Balance ratio
NaNx
What you're seeing: Each token (left) is routed to 2 experts (right). The bars below show the cumulative load on each expert. Notice how some experts get more traffic than others—this is why load balancing is crucial to avoid bottlenecks.
If the router routes 90% of tokens to 2 experts and 10% to the other 6, you have a bottleneck. Those 2 overloaded experts can't process all tokens efficiently. You hit memory bandwidth limits, context switching overhead, and synchronization delays.
Meanwhile, the 6 underutilized experts sit idle. You paid for their parameters but aren't using them. This defeats the purpose of sparse routing.
The solution: auxiliary loss. During training, we penalize the router for producing imbalanced routings. This incentivizes the router to spread load evenly, ensuring all experts pull their weight.
Quick check
What's the load balancing auxiliary loss designed to do?
Expert capacity: The hard constraint
Each expert can only process so many tokens before running out of memory. We call this the expert capacity. When capacity is exceeded, we have to drop or reroute excess tokens.
Capacity is computed as:
capacity = ceil(
(total_tokens / num_experts)
* capacity_factor
)
The capacity_factor (usually 1.25) adds headroom for imbalance. If you set it to 1.0, each expert gets exactly its fair share, but if load is imbalanced, you'll overflow frequently.
Higher capacity_factor (e.g., 1.5 or 2.0) wastes compute on idle experts but prevents overflow.
Example calculation
With capacity_factor = 1.25:
(32 * 1.25 = 40)
In practice, you either:
1. Drop them (treated as ignored during loss computation)
2. Reroute them to the next-best expert (might cascade)
3. Increase capacity_factor dynamically during training
Most systems drop overflowing tokens, which creates a small training signal that the router learns to avoid. It's simple but slightly wastes tokens. A smarter approach is expert adaptive load balancing, which adjusts capacity dynamically based on observed imbalance.
Quick check
If you have 100 tokens and 4 experts with capacity_factor=1.5, what's the capacity per expert?
Step-by-step: From dense to sparse
Step 1 of 6
17%
Dense models: Every token sees every expert
Traditional neural networks process all tokens through all layers.
Dense computation: Full matrix multiply
output = token @ W_1 @ W_2 @ ... @ W_n
In a dense model like Transformer:
- • Every token attends to every other token (attention)
- • Every token passes through every feed-forward expert
- • All parameters are always active
This is compute-intensive but straightforward. All tokens get the same capacity and compute budget.
Key takeaway
MoE = sparse routing + learned specialization + load balancing. Each token only activates a small subset of experts, scaling model capacity without proportional compute increase.
Playground: Build your own MoE configuration
Use the sliders below to design an MoE architecture. See how each parameter affects throughput, sparsity, and balance. Try the presets to understand the tradeoffs.
MoE Playground: Configure & experiment
Quick presets
More experts = more specialization, but higher routing overhead
Higher k = more compute per token, but more capacity
Higher = more tolerance for load imbalance, but wasted compute
Larger batches amortize routing overhead
Capacity per expert
0
Tokens each expert can process
Sparsity
0%
Experts inactive per token
Est. throughput
0
Tokens/sec (relative)
Balance metric
0.000
✓ Balanced
Configuration summary
💡 Insights
- • Increasing experts scales model size (8x) without full compute increase
- • Top-k=2 means each token uses 25% of experts
- • Capacity_factor=1.25 provides 25% headroom for load imbalance
- • Balance metric of 0.000 indicates good load distribution
Real-world MoE systems
Switch Transformer
Experts
16,000
Top-k
1
7x efficiency gain
Massive scale with top-k=1 (extreme sparsity)
Paper →Key takeaways
Sparse routing is the core
A learned router sends each token to only top-k experts, activating ~1–10% of parameters per token.
Load balancing is critical
Without an auxiliary loss, routers collapse: all tokens go to the same few experts, creating bottlenecks.
Capacity prevents overflow
Each expert has a fixed capacity. Overflow tokens are dropped or rerouted. Capacity_factor tunes the tradeoff.
Scaling is the payoff
MoE enables 1 trillion-parameter models with compute budgets similar to dense 100B-parameter models.
Communication costs matter
Distributing tokens across experts requires efficient all-to-all communication. This can be 20% of total cost.
Training is harder
Sparse, discrete routing creates discrete optimization problems. Techniques like straight-through estimators help.
Final knowledge checks
Quick check
Which is NOT a benefit of Mixture of Experts?
Quick check
What does a capacity_factor of 1.5 mean?
Quick check
In the auxiliary loss loss_total = loss_task + 0.01 * loss_balance, what does λ=0.01 control?
Quick check
What happens if you increase num_experts but keep top-k the same?
Quick check
True or False: Switch Transformer uses top-k=1, selecting only one expert per token.
Go deeper
Sparse routing is discrete: either a token goes to an expert or it doesn't. But gradients need to flow back through the router to update its weights. This is tricky because argmax() is not differentiable.
The solution is the straight-through estimator (STE): during backward pass, treat top-k selection as if it were a soft (differentiable) operation. This lets gradients flow to non-selected experts too, encouraging the router to explore new routings.
Alternative: use Gumbel-softmax to sample top-k stochastically, which is differentiable. But STE is simpler and works well in practice.
During training, we use STE to relax the discrete decision. But during inference, we want deterministic, efficient routing. We simply take the hardmax: send each token to the top-k experts with no randomness.
This is deterministic and reproducible. But it means inference routing is different from training routing, which can hurt accuracy slightly. Some systems add a small temperature or noise during training to match inference routing better.
This is called the train-test mismatch problem in MoE systems and is an active area of research.
In a distributed MoE system, tokens for expert i might be on GPU j, but expert i is on GPU k. You need to send the tokens from j to k, process them, and send results back. This is all-to-all communication, which can be very expensive.
For a system with 16 GPUs and 128 experts (8 experts per GPU), each token might need to cross multiple GPU boundaries. The bandwidth cost can exceed the computation cost!
Solutions: expert locality (assign experts to GPUs strategically), gradient checkpointing, and expert parallelism (experts are themselves parallelized across GPUs).
Yes! You can combine MoE with:
• Sparse attention (Longformer, BigBird): not all tokens attend to all positions
• Pruning: remove low-weight edges in experts
• Quantization: store experts in lower precision
The limiting factor is usually the all-to-all communication. Attention sparsity helps with that layer. Pruning + quantization can reduce expert memory. In practice, most large systems use MoE + attention sparsity.