Sparse Autoencoders & Interpretability
Interpretability studies what neural networks represent internally. Sparse autoencoders expose features that are otherwise tangled across neurons.
Mental model
Look for the model’s internal concepts, not just its final answer.
Understanding features and circuits helps with debugging, safety, steering, and trust in high-stakes models.
Feature clarity
balanced70% modeled signal
Coverage
balanced59% modeled signal
Causal confidence
balanced53% modeled signal
Concept pipeline
Build the idea in four moves
Interactive lab
Extract interpretable features from a hidden layer.
Record
Collect activations from a model layer.
Focus lens
The part that usually clicks late
Superposition
Many features can share fewer neurons.
Feature clarity
70
Coverage
59
Causal confidence
53
Knowledge check
What is superposition?
Next horizon