Interactive lesson~18 minIntermediate

GPU Architecture for ML

GPUs accelerate ML by running thousands of simple operations in parallel. Performance depends as much on memory movement as arithmetic.

CUDA coresTensor coresBandwidth

Mental model

Keep the many tiny workers fed without waiting on memory.

Understanding warps, tensor cores, bandwidth, and occupancy helps explain why kernels are fast or painfully slow.

Tensor-core use

balanced

69% modeled signal

Bandwidth pressure

balanced

66% modeled signal

Throughput

balanced

55% modeled signal

Concept pipeline

Build the idea in four moves

Interactive lab

Tune a matrix kernel for throughput.

Threads

Group work into warps that execute together.

Focus lens

The part that usually clicks late

Occupancy

Enough active warps hide latency.

Tensor-core use

69

Bandwidth pressure

66

Throughput

55

Knowledge check

Why do fused kernels help?

Next horizon

Where this topic is headed

Triton kernels
FP8 training
Memory-bound profiling
Back to all lessons

Finished this lesson?

Mark it as complete to track your progress and get a certificate.