Distributed Training
Distributed training splits model work across many accelerators. The art is keeping devices busy while moving as little data as possible.
Mental model
Training at scale is choreography between compute and communication.
Large models require data parallelism, tensor parallelism, pipeline parallelism, and memory sharding to fit and train efficiently.
Memory fit
balanced72% modeled signal
Throughput
balanced58% modeled signal
Scaling efficiency
balanced59% modeled signal
Concept pipeline
Build the idea in four moves
Interactive lab
Choose a parallelism strategy for a large model.
Shard
Split data, parameters, gradients, or layers.
Focus lens
The part that usually clicks late
DDP
Replicate models and average gradients across data shards.
Memory fit
72
Throughput
58
Scaling efficiency
59
Knowledge check
What does FSDP mainly shard?
Next horizon