Interactive~18 minAdvanced

Mixture of Experts — scaling models through sparse routing

Most large language models use dense forward passes: every token touches every parameter. Mixture of Experts (MoE) changes this with a router network that sends each token to only a few specialized experts. This unlocks massive scale—GPT-3–sized capacity with far less compute. Learn how routing, gating, and load balancing work together to make it practical.

Why Mixture of Experts?

Modern language models are massive. GPT-3 has 175 billion parameters. Each forward pass requires computing gradients for all of them. This is why training costs millions of dollars.

Mixture of Experts sidesteps this. Instead of routing every token through every parameter, we add a router network that learns to send each token to only a few specialized experts. A token might go to the "code expert" and "reasoning expert", skipping the "poetry expert."

The result: scale to 1 trillion parameters while keeping compute costs reasonable. Google's Switch Transformer uses this to achieve 7x better efficiency than dense models.

The tradeoff

Dense models: Every parameter is always active, but compute is predictable. MoE: Only a small fraction activates per token, but routing adds complexity and requires careful load balancing to avoid bottlenecks.

Dense: All parameters active

100% parameter utilization

MoE: Only selected experts active

12% parameter utilization (2 of 16)

The compute savings depend on the ratio of experts to top-k selections. With 128 experts and top-k=2, each token activates only 2/128 = 1.5% of parameters. That's a potential 66x speedup.

In practice, it's less. Routing adds overhead, synchronization costs increase, and not all experts are equally loaded. But even with these costs, you get 5-10x efficiency gains.

Real systems like Switch Transformer show 7x better efficiency (FLOPs per token) compared to dense models like GPT-3, while maintaining similar or better accuracy.

?

Quick check

What's the main advantage of Mixture of Experts over dense models?

The router: How tokens get dispatched

The core of MoE is the router network—a small learned neural network that decides which experts receive each token. Let's see how it works.

Router: Token → Expert Selection

Input token (position 5)

Token

Gating network scores

Softmax probabilities

Top-2 selection (dispatch to experts)

Experts activated

2 / 8

Sparsity

75%

Top expert score

-Infinity

How it works: The router network computes scores for each expert. These scores go through softmax to become probabilities. The top-2 experts receive the token. This sparse routing saves computation by only activating 25% of the experts.

Raw scores can be negative, vary wildly, and don't sum to 1. Softmax normalizes them to probabilities: all in [0, 1] and summing to 1. This is essential for top-k selection—we need to know which k experts score highest.

Softmax is differentiable everywhere, so gradients flow back to the router weights during training. This lets the router learn which features of the input should activate which experts.

?

Quick check

In the router, what does top-k selection mean?

Expert load: Distribution and balance

As tokens flow through the MoE layer, some experts get more traffic than others. This creates a subtle but critical problem: load imbalance can destroy efficiency.

Expert Grid: Token flow & load distribution

Expert utilization (brightness = more tokens)

E0

0 tokens

E1

0 tokens

E2

0 tokens

E3

0 tokens

E4

0 tokens

E5

0 tokens

E6

0 tokens

E7

0 tokens

Token routing sequence (token 0 / 16)

Load distribution (max 1 tokens)

Most loaded

E-1

Least loaded

E-1

Balance ratio

NaNx

What you're seeing: Each token (left) is routed to 2 experts (right). The bars below show the cumulative load on each expert. Notice how some experts get more traffic than others—this is why load balancing is crucial to avoid bottlenecks.

If the router routes 90% of tokens to 2 experts and 10% to the other 6, you have a bottleneck. Those 2 overloaded experts can't process all tokens efficiently. You hit memory bandwidth limits, context switching overhead, and synchronization delays.

Meanwhile, the 6 underutilized experts sit idle. You paid for their parameters but aren't using them. This defeats the purpose of sparse routing.

The solution: auxiliary loss. During training, we penalize the router for producing imbalanced routings. This incentivizes the router to spread load evenly, ensuring all experts pull their weight.

?

Quick check

What's the load balancing auxiliary loss designed to do?

Expert capacity: The hard constraint

Each expert can only process so many tokens before running out of memory. We call this the expert capacity. When capacity is exceeded, we have to drop or reroute excess tokens.

Capacity is computed as:

capacity = ceil(

(total_tokens / num_experts)

* capacity_factor

)

The capacity_factor (usually 1.25) adds headroom for imbalance. If you set it to 1.0, each expert gets exactly its fair share, but if load is imbalanced, you'll overflow frequently.

Higher capacity_factor (e.g., 1.5 or 2.0) wastes compute on idle experts but prevents overflow.

Example calculation

Total tokens:256
Number of experts:8
Fair share:32 tokens

With capacity_factor = 1.25:

Capacity:40 tokens

(32 * 1.25 = 40)

In practice, you either:
1. Drop them (treated as ignored during loss computation)
2. Reroute them to the next-best expert (might cascade)
3. Increase capacity_factor dynamically during training

Most systems drop overflowing tokens, which creates a small training signal that the router learns to avoid. It's simple but slightly wastes tokens. A smarter approach is expert adaptive load balancing, which adjusts capacity dynamically based on observed imbalance.

?

Quick check

If you have 100 tokens and 4 experts with capacity_factor=1.5, what's the capacity per expert?

Step-by-step: From dense to sparse

Step 1 of 6

17%

Dense models: Every token sees every expert

Traditional neural networks process all tokens through all layers.

Dense computation: Full matrix multiply

output = token @ W_1 @ W_2 @ ... @ W_n

In a dense model like Transformer:

  • • Every token attends to every other token (attention)
  • • Every token passes through every feed-forward expert
  • • All parameters are always active

This is compute-intensive but straightforward. All tokens get the same capacity and compute budget.

Key takeaway

MoE = sparse routing + learned specialization + load balancing. Each token only activates a small subset of experts, scaling model capacity without proportional compute increase.

Playground: Build your own MoE configuration

Use the sliders below to design an MoE architecture. See how each parameter affects throughput, sparsity, and balance. Try the presets to understand the tradeoffs.

MoE Playground: Configure & experiment

Quick presets

8

More experts = more specialization, but higher routing overhead

2

Higher k = more compute per token, but more capacity

1.25

Higher = more tolerance for load imbalance, but wasted compute

64

Larger batches amortize routing overhead

Capacity per expert

0

Tokens each expert can process

Sparsity

0%

Experts inactive per token

Est. throughput

0

Tokens/sec (relative)

Balance metric

0.000

✓ Balanced

Configuration summary

Total experts: 8
Experts/token: 2
Total params (experts): ~8x
Active params/token: ~25%
Batch size: 64
Cap factor: 1.25x

💡 Insights

  • • Increasing experts scales model size (8x) without full compute increase
  • • Top-k=2 means each token uses 25% of experts
  • • Capacity_factor=1.25 provides 25% headroom for load imbalance
  • • Balance metric of 0.000 indicates good load distribution

Real-world MoE systems

Switch Transformer

Experts

16,000

Top-k

1

7x efficiency gain

Massive scale with top-k=1 (extreme sparsity)

Paper →

GLaM

Experts

1,024

Top-k

2

5x efficiency gain

Billion-scale with balanced routing

Paper →

Mixtral 8x7B

Experts

8

Top-k

2

2.7x efficiency gain

Lightweight MoE for fine-tuning

Paper →

GShard

Experts

2,048

Top-k

2

6x efficiency gain

Expert-wise sharding for large-scale training

Paper →

ST-MoE

Experts

2,048

Top-k

2

8x efficiency gain

Stable training with expert routing

Paper →

LLaMA-MoE

Experts

8–16

Top-k

2

1.5x efficiency gain

MoE applied to open-source LLaMA

Paper →

Key takeaways

Sparse routing is the core

A learned router sends each token to only top-k experts, activating ~1–10% of parameters per token.

Load balancing is critical

Without an auxiliary loss, routers collapse: all tokens go to the same few experts, creating bottlenecks.

Capacity prevents overflow

Each expert has a fixed capacity. Overflow tokens are dropped or rerouted. Capacity_factor tunes the tradeoff.

Scaling is the payoff

MoE enables 1 trillion-parameter models with compute budgets similar to dense 100B-parameter models.

Communication costs matter

Distributing tokens across experts requires efficient all-to-all communication. This can be 20% of total cost.

Training is harder

Sparse, discrete routing creates discrete optimization problems. Techniques like straight-through estimators help.

Final knowledge checks

?

Quick check

Which is NOT a benefit of Mixture of Experts?

?

Quick check

What does a capacity_factor of 1.5 mean?

?

Quick check

In the auxiliary loss loss_total = loss_task + 0.01 * loss_balance, what does λ=0.01 control?

?

Quick check

What happens if you increase num_experts but keep top-k the same?

?

Quick check

True or False: Switch Transformer uses top-k=1, selecting only one expert per token.

Go deeper

Sparse routing is discrete: either a token goes to an expert or it doesn't. But gradients need to flow back through the router to update its weights. This is tricky because argmax() is not differentiable.

The solution is the straight-through estimator (STE): during backward pass, treat top-k selection as if it were a soft (differentiable) operation. This lets gradients flow to non-selected experts too, encouraging the router to explore new routings.

Alternative: use Gumbel-softmax to sample top-k stochastically, which is differentiable. But STE is simpler and works well in practice.

During training, we use STE to relax the discrete decision. But during inference, we want deterministic, efficient routing. We simply take the hardmax: send each token to the top-k experts with no randomness.

This is deterministic and reproducible. But it means inference routing is different from training routing, which can hurt accuracy slightly. Some systems add a small temperature or noise during training to match inference routing better.

This is called the train-test mismatch problem in MoE systems and is an active area of research.

In a distributed MoE system, tokens for expert i might be on GPU j, but expert i is on GPU k. You need to send the tokens from j to k, process them, and send results back. This is all-to-all communication, which can be very expensive.

For a system with 16 GPUs and 128 experts (8 experts per GPU), each token might need to cross multiple GPU boundaries. The bandwidth cost can exceed the computation cost!

Solutions: expert locality (assign experts to GPUs strategically), gradient checkpointing, and expert parallelism (experts are themselves parallelized across GPUs).

Yes! You can combine MoE with:
Sparse attention (Longformer, BigBird): not all tokens attend to all positions
Pruning: remove low-weight edges in experts
Quantization: store experts in lower precision

The limiting factor is usually the all-to-all communication. Attention sparsity helps with that layer. Pruning + quantization can reduce expert memory. In practice, most large systems use MoE + attention sparsity.

Finished this lesson?

Mark it as complete to track your progress and get a certificate.