Convolutions & CNNs — learning visual features
Convolutional Neural Networks are the foundational architecture for computer vision. Instead of treating images as flat arrays of pixels, CNNs learn hierarchical features through sliding kernels, pooling, and activation functions. Understand how a simple 3×3 kernel becomes a powerful feature detector, and how stacking them builds a complete image classifier.
What is Convolution?
A convolution is a mathematical operation where a small matrix (the kernel) slides across an image or feature map. At each position, we compute the dot product between the kernel and the overlapping region. This produces a single value in the output.
The beauty of convolution is that it's parameter-efficient. Instead of a fully connected layer with millions of weights, a single 3×3 kernel detects the same pattern everywhere in the image. This also builds in translation invariance: the network recognizes features regardless of where they appear.
Convolution formula
output[i,j] = Σk,l input[i+k, j+l] × kernel[k,l]
Stride
How many pixels the kernel slides each step. Stride=2 means skip every other position.
Padding
Zero-padding around edges. Preserves spatial dimensions and lets kernels see boundary pixels.
Interactive Convolution
Step size for kernel sliding
Zero-padding around edges
Input (5×5)
Checkerboard pattern
Output (3×3)
Edge-detected output (hover to highlight receptive field)
Output dimensions: (H − K + 2P) / S + 1 = (5 − 3 + 2×0) / 1 + 1 = 3
H = input height, K = kernel size, P = padding, S = stride
Quick check
What does a kernel in convolution do?
Kernels & Feature Maps
Different kernels detect different patterns. An edge detection kernel highlights boundaries, a blur kernel smooths, and a sharpen kernel enhances edges. By learning the right kernels, a CNN can detect semantic features: textures, shapes, objects. Below, we apply different filters to the same input and see how feature maps change.
Feature Maps from Different Filters
Zero out negative values
Detects edges and boundaries in the image
Input Image
28×28 input (downsampled for display)
Kernel (3×3)
Edge detection kernel
Feature Map (after ReLU)
Result after convolution
How it works: The kernel slides across the input image. At each position, we compute the dot product between the kernel and the overlapping region. This produces one value in the output feature map. Different kernels detect different patterns—edges, textures, colors, and more.
Quick check
Why can a CNN learn to detect objects by learning kernels?
Pooling & Downsampling
After convolution, we apply pooling to reduce spatial dimensions. This serves multiple purposes: reduces computation, reduces memory, provides robustness to small translations, and helps the network learn more robust features.
Max pooling takes the maximum value in each window (e.g., 2×2). This preserves the strongest activation, useful for detecting whether a feature is present.Average pooling takes the mean, providing a smoother summary.
Computational benefit
2×2 pooling with stride 2 reduces spatial dims by 4×. Fewer computations downstream.
Robustness
Small shifts in the image affect pooled output less. Promotes position invariance.
Pooling Operations
Size of pooling window (stride = window size)
Input (8×8)
Click on output cells to highlight contributing input regions
Output (4×4)
Hover over cells to see which inputs contributed
Max Pooling: Takes the maximum value in each window. Preserves the strongest features.
Average Pooling: Takes the average value in each window. Smoother representation of features.
Quick check
What is the main computational benefit of max pooling?
The Complete CNN Pipeline
A complete CNN pipeline flows: input → convolution → ReLU → pooling → (repeat) → flatten → dense layers → softmax. Each stage builds on the previous one. Let's walk through all 6 steps with real feature visualizations.
Guided 6-Step Walkthrough
Input Image
Raw pixel values (28×28 grayscale). This is what the CNN sees.
28×28 Pixel Grid (Downsampled for Display)
Value range: 0 (black) to 255 (white)
Key Concepts:
- • Convolution: Sliding dot-product with filters
- • ReLU: Non-linearity (helps learn complex patterns)
- • Pooling: Downsampling (reduces computation)
- • Flatten: 2D → 1D (feeds into dense layers)
- • Classification: Final softmax produces probabilities
Quick check
Why do we apply ReLU after convolution?
CNN Architecture Patterns
LeNet (1998)
One of the first CNNs. 2 conv layers, 2 pooling layers, 2 dense layers. Trained on handwritten digits (MNIST).
Conv → ReLU → Pool → Conv → ReLU → Pool → Dense → Dense
AlexNet (2012)
Won ImageNet with 5 conv layers, ReLU, max pooling, and dropout. Introduced deep CNNs to mainstream.
Conv → ReLU → Pool → Conv → ReLU → Pool → Conv → Dense × 3
VGG (2014)
Showed that deeper networks work better. Uses small 3×3 kernels stacked instead of large kernels.
Many [Conv(3×3) → ReLU] blocks → Pool → Repeat
ResNet (2015)
Introduced residual connections to train very deep networks (50+ layers). x ← x + f(x).
Conv → [Residual Block]× many → Pool → Dense
Quick check
What is the key architectural insight from ResNet?
CNN Playground
Design your own CNN! Choose filter types, stack layers, toggle pooling, and watch how the feature maps cascade. See how many features you can extract before flattening.
Quick check
In the playground, what happens when you increase the number of convolutional layers?
Key Takeaways
- 1A convolution slides a kernel across an image and computes dot products to detect patterns.
- 2Different kernels detect different features: edges, textures, shapes, objects.
- 3Parameter sharing (reusing the same kernel everywhere) makes CNNs efficient and translation-invariant.
- 4Pooling reduces spatial dimensions, cutting computation and providing robustness.
- 5ReLU introduces non-linearity, allowing the network to learn complex patterns.
- 6Stacking convolution layers hierarchically builds from simple features (edges) to complex ones (objects).
- 7The pipeline is: Image → Conv(s) → ReLU → Pool → Repeat → Flatten → Dense → Softmax.
- 8CNNs are the foundation of modern computer vision. Mastering them opens doors to image classification, detection, segmentation, and more.
What's Next?
You now understand how CNNs learn visual features through convolution, activation, and pooling. To go deeper, explore:
Convolutional Layers in Depth
Batch norm, dilated convolutions, grouped convolutions.
Famous Architectures
ResNet, DenseNet, EfficientNet, Vision Transformers.
Advanced Techniques
Data augmentation, transfer learning, fine-tuning.
Beyond Classification
Object detection (YOLO, R-CNN), segmentation, pose estimation.