Interactive~18 minIntermediate

Convolutions & CNNs — learning visual features

Convolutional Neural Networks are the foundational architecture for computer vision. Instead of treating images as flat arrays of pixels, CNNs learn hierarchical features through sliding kernels, pooling, and activation functions. Understand how a simple 3×3 kernel becomes a powerful feature detector, and how stacking them builds a complete image classifier.

What is Convolution?

A convolution is a mathematical operation where a small matrix (the kernel) slides across an image or feature map. At each position, we compute the dot product between the kernel and the overlapping region. This produces a single value in the output.

The beauty of convolution is that it's parameter-efficient. Instead of a fully connected layer with millions of weights, a single 3×3 kernel detects the same pattern everywhere in the image. This also builds in translation invariance: the network recognizes features regardless of where they appear.

Convolution formula

output[i,j] = Σk,l input[i+k, j+l] × kernel[k,l]

Stride

How many pixels the kernel slides each step. Stride=2 means skip every other position.

Padding

Zero-padding around edges. Preserves spatial dimensions and lets kernels see boundary pixels.

Interactive Convolution

Step size for kernel sliding

Zero-padding around edges

Input (5×5)

Checkerboard pattern

Output (3×3)

Edge-detected output (hover to highlight receptive field)

Output dimensions: (H − K + 2P) / S + 1 = (53 + 2×0) / 1 + 1 = 3

H = input height, K = kernel size, P = padding, S = stride

?

Quick check

What does a kernel in convolution do?

Kernels & Feature Maps

Different kernels detect different patterns. An edge detection kernel highlights boundaries, a blur kernel smooths, and a sharpen kernel enhances edges. By learning the right kernels, a CNN can detect semantic features: textures, shapes, objects. Below, we apply different filters to the same input and see how feature maps change.

Feature Maps from Different Filters

Zero out negative values

Detects edges and boundaries in the image

Input Image

28×28 input (downsampled for display)

Kernel (3×3)

-1.00.01.0-2.00.02.0-1.00.01.0

Edge detection kernel

Feature Map (after ReLU)

Result after convolution

How it works: The kernel slides across the input image. At each position, we compute the dot product between the kernel and the overlapping region. This produces one value in the output feature map. Different kernels detect different patterns—edges, textures, colors, and more.

?

Quick check

Why can a CNN learn to detect objects by learning kernels?

A 28×28 image flattened is 784 pixels. A fully connected layer to 128 neurons needs 784 × 128 = 100,352 weights. A 3×3 convolution to 32 filters needs only 3 × 3 × 32 = 288 weights. Convolution achieves parameter efficiency through weight sharing: the same kernel detects the same pattern everywhere. This also provides translation invariance: if a cat moves across the image, the network still recognizes it because the filters work everywhere.

Pooling & Downsampling

After convolution, we apply pooling to reduce spatial dimensions. This serves multiple purposes: reduces computation, reduces memory, provides robustness to small translations, and helps the network learn more robust features.

Max pooling takes the maximum value in each window (e.g., 2×2). This preserves the strongest activation, useful for detecting whether a feature is present.Average pooling takes the mean, providing a smoother summary.

Computational benefit

2×2 pooling with stride 2 reduces spatial dims by 4×. Fewer computations downstream.

Robustness

Small shifts in the image affect pooled output less. Promotes position invariance.

Pooling Operations

Size of pooling window (stride = window size)

Input (8×8)

1210321423212132121010213432433523213223121021124543544634324335

Click on output cells to highlight contributing input regions

Output (4×4)

3.02.03.04.04.03.04.05.03.02.03.03.05.04.05.06.0

Hover over cells to see which inputs contributed

Max Pooling: Takes the maximum value in each window. Preserves the strongest features.

Average Pooling: Takes the average value in each window. Smoother representation of features.

?

Quick check

What is the main computational benefit of max pooling?

Max pooling is more commonly used because it preserves the strongest signal. If a feature detector fires at any location in the window, max pooling passes that signal forward. Average pooling smooths the signal and can dilute important activations. However, average pooling is sometimes used in modern architectures like ResNet because batch normalization handles feature normalization. For your first CNN, max pooling is the standard choice.

The Complete CNN Pipeline

A complete CNN pipeline flows: input → convolution → ReLU → pooling → (repeat) → flatten → dense layers → softmax. Each stage builds on the previous one. Let's walk through all 6 steps with real feature visualizations.

Guided 6-Step Walkthrough

Input Image

Raw pixel values (28×28 grayscale). This is what the CNN sees.

28×28 Pixel Grid (Downsampled for Display)

Value range: 0 (black) to 255 (white)

Key Concepts:

  • Convolution: Sliding dot-product with filters
  • ReLU: Non-linearity (helps learn complex patterns)
  • Pooling: Downsampling (reduces computation)
  • Flatten: 2D → 1D (feeds into dense layers)
  • Classification: Final softmax produces probabilities
?

Quick check

Why do we apply ReLU after convolution?

After multiple convolution and pooling layers, you have a 3D tensor: height × width × channels (e.g., 7×7×64). The dense (fully connected) layers expect a 1D vector. Flattening reshapes this to a single 1D array (7×7×64 = 3136 values). These flattened features are then passed through one or more dense layers to produce class logits, which are converted to probabilities via softmax.

CNN Architecture Patterns

LeNet (1998)

One of the first CNNs. 2 conv layers, 2 pooling layers, 2 dense layers. Trained on handwritten digits (MNIST).

Conv → ReLU → Pool → Conv → ReLU → Pool → Dense → Dense

AlexNet (2012)

Won ImageNet with 5 conv layers, ReLU, max pooling, and dropout. Introduced deep CNNs to mainstream.

Conv → ReLU → Pool → Conv → ReLU → Pool → Conv → Dense × 3

VGG (2014)

Showed that deeper networks work better. Uses small 3×3 kernels stacked instead of large kernels.

Many [Conv(3×3) → ReLU] blocks → Pool → Repeat

ResNet (2015)

Introduced residual connections to train very deep networks (50+ layers). x ← x + f(x).

Conv → [Residual Block]× many → Pool → Dense

?

Quick check

What is the key architectural insight from ResNet?

CNN Playground

Design your own CNN! Choose filter types, stack layers, toggle pooling, and watch how the feature maps cascade. See how many features you can extract before flattening.

?

Quick check

In the playground, what happens when you increase the number of convolutional layers?

(1) Start with 1-2 conv layers with 32 filters. (2) Use 3×3 kernels—they're efficient and standard. (3) Use stride=1 and padding='same' to preserve spatial dimensions, then use pooling to downsample. (4) Stack conv layers in blocks: each block is conv(F) → ReLU → pool. Increase filter count per layer (32→64→128). (5) After spatial dims get small (7×7 or less), flatten and add 1-2 dense layers. (6) Use dropout for regularization.

Key Takeaways

  • 1A convolution slides a kernel across an image and computes dot products to detect patterns.
  • 2Different kernels detect different features: edges, textures, shapes, objects.
  • 3Parameter sharing (reusing the same kernel everywhere) makes CNNs efficient and translation-invariant.
  • 4Pooling reduces spatial dimensions, cutting computation and providing robustness.
  • 5ReLU introduces non-linearity, allowing the network to learn complex patterns.
  • 6Stacking convolution layers hierarchically builds from simple features (edges) to complex ones (objects).
  • 7The pipeline is: Image → Conv(s) → ReLU → Pool → Repeat → Flatten → Dense → Softmax.
  • 8CNNs are the foundation of modern computer vision. Mastering them opens doors to image classification, detection, segmentation, and more.

What's Next?

You now understand how CNNs learn visual features through convolution, activation, and pooling. To go deeper, explore:

Convolutional Layers in Depth

Batch norm, dilated convolutions, grouped convolutions.

Famous Architectures

ResNet, DenseNet, EfficientNet, Vision Transformers.

Advanced Techniques

Data augmentation, transfer learning, fine-tuning.

Beyond Classification

Object detection (YOLO, R-CNN), segmentation, pose estimation.

Finished this lesson?

Mark it as complete to track your progress and get a certificate.