GPU Architecture for ML
GPUs accelerate ML by running thousands of simple operations in parallel. Performance depends as much on memory movement as arithmetic.
Mental model
Keep the many tiny workers fed without waiting on memory.
Understanding warps, tensor cores, bandwidth, and occupancy helps explain why kernels are fast or painfully slow.
Tensor-core use
balanced69% modeled signal
Bandwidth pressure
balanced66% modeled signal
Throughput
balanced55% modeled signal
Concept pipeline
Build the idea in four moves
Interactive lab
Tune a matrix kernel for throughput.
Threads
Group work into warps that execute together.
Focus lens
The part that usually clicks late
Occupancy
Enough active warps hide latency.
Tensor-core use
69
Bandwidth pressure
66
Throughput
55
Knowledge check
Why do fused kernels help?
Next horizon