Interactive lesson~18 minIntermediate

GPU Architecture for ML

GPUs accelerate ML by running thousands of simple operations in parallel. Performance depends as much on memory movement as arithmetic.

CUDA coresTensor coresBandwidth

Mental model

Keep the many tiny workers fed without waiting on memory.

Understanding warps, tensor cores, bandwidth, and occupancy helps explain why kernels are fast or painfully slow.

Tensor-core use

balanced

69% modeled signal

Bandwidth pressure

balanced

66% modeled signal

Throughput

balanced

55% modeled signal

Concept pipeline

Build the idea in four moves

Interactive lab

Tune a matrix kernel for throughput.

Threads

Group work into warps that execute together.

Tile size60

smalllarge

Memory reuse72

poorstrong

Precision50

FP32FP8

Focus lens

The part that usually clicks late

Occupancy

Enough active warps hide latency.

Tensor-core use

Bandwidth pressure

Throughput

Knowledge check

Why do fused kernels help?

Next horizon

Where this topic is headed

Triton kernels

FP8 training

Memory-bound profiling

Back to all lessons

GPU Architecture for ML

Build the idea in four moves

Threads

Memory

Tensor cores

Kernel

Tune a matrix kernel for throughput.

The part that usually clicks late

Why do fused kernels help?

Where this topic is headed

Finished this lesson?