Interpretability

What Happens to a Neural Network's Geometry When You Change How It Learns?

Or: the same architecture + the same data + different learning algorithms = radically different internal structure A gap in the Platonic Representation Hypothesis The Platonic Representation Hypothesis (Huh et al., ICML 2024) claims that different neural networks converge toward the same internal representation of reality. They tested this across dozens of architectures — CNNs, ViTs, language models — and found increasing alignment as models get bigger. It's a compelling result. But every single model they tested was trained with backpropagation. ...

Geometric Similarity Is Blind to Computational Structure

This post starts with a simple question — how would you tell if two neural networks learned the same thing? — and builds to a case where the standard answer is dangerously wrong. How would you compare two networks? Suppose you train two neural networks on the same task from different random initializations, and both get 99% accuracy. Did they learn the same thing? You can't just compare the raw activation values. To see why, think about a simpler example. Imagine two spreadsheets tracking student performance. One has columns [math_score, reading_score]. The other has columns [total_score, score_difference]. Both contain the same information — you can convert between them with simple arithmetic — but the raw numbers look completely different. A student with (90, 80) in the first spreadsheet would be (170, 10) in the second. ...

Mechanistic Interpretability: Circuits, Superposition, and Sparse Autoencoders

Prior reading: Probing | Why Sparsity? | Platonic Forms What Mechanistic Interpretability Is Trying to Do Mechanistic interpretability (mech interp) aims to reverse-engineer neural networks into human-understandable components. Not "what features does this layer represent?" (that's probing — → see probing post) but "what algorithm does this network implement, and how?" The analogy: probing tells you a chip has memory. Mech interp tells you it's a flip-flop built from NAND gates. Why It Matters for Safety If we can understand the mechanism by which a model produces an output, we can: ...

Platonic Forms in Near-Capacity Models

Prior reading: Probing: What Do Models Actually Know? The Platonic Representation Hypothesis As models scale and train on more data, their internal representations appear to converge — different architectures, different modalities, even different training objectives produce increasingly similar feature spaces. Are models discovering the "true structure" of the world? What This Looks Like Vision models and language models learn similar geometric structures Larger models have more similar representations to each other than smaller ones Cross-modal transfer works better as models scale The Platonic Analogy Plato's forms: there exist ideal, abstract representations of concepts that physical instances approximate. Near-capacity models may be approximating these forms — not because they're philosophical, but because the data constrains the geometry of good representations. ...

Probing: What Do Models Actually Know?

Prior reading: Gradient Descent and Backpropagation | Why Sparsity? What Is Probing? Train a simple classifier (usually linear) on a model's internal representations to test whether specific information is encoded there. If a linear probe can extract "is this sentence toxic?" from layer 12 activations, the model represents toxicity at that layer. How It Works Freeze the model Extract activations at a chosen layer for a labeled dataset Train a linear (or shallow) classifier on those activations Measure accuracy High accuracy → the information is linearly accessible in the representation. ...