Representations

What Happens to a Neural Network's Geometry When You Change How It Learns?

Or: the same architecture + the same data + different learning algorithms = radically different internal structure A gap in the Platonic Representation Hypothesis The Platonic Representation Hypothesis (Huh et al., ICML 2024) claims that different neural networks converge toward the same internal representation of reality. They tested this across dozens of architectures — CNNs, ViTs, language models — and found increasing alignment as models get bigger. It's a compelling result. But every single model they tested was trained with backpropagation. ...

Geometric Similarity Is Blind to Computational Structure

This post starts with a simple question — how would you tell if two neural networks learned the same thing? — and builds to a case where the standard answer is dangerously wrong. How would you compare two networks? Suppose you train two neural networks on the same task from different random initializations, and both get 99% accuracy. Did they learn the same thing? You can't just compare the raw activation values. To see why, think about a simpler example. Imagine two spreadsheets tracking student performance. One has columns [math_score, reading_score]. The other has columns [total_score, score_difference]. Both contain the same information — you can convert between them with simple arithmetic — but the raw numbers look completely different. A student with (90, 80) in the first spreadsheet would be (170, 10) in the second. ...

Platonic Forms in Near-Capacity Models

Prior reading: Probing: What Do Models Actually Know? The Platonic Representation Hypothesis As models scale and train on more data, their internal representations appear to converge — different architectures, different modalities, even different training objectives produce increasingly similar feature spaces. Are models discovering the "true structure" of the world? What This Looks Like Vision models and language models learn similar geometric structures Larger models have more similar representations to each other than smaller ones Cross-modal transfer works better as models scale The Platonic Analogy Plato's forms: there exist ideal, abstract representations of concepts that physical instances approximate. Near-capacity models may be approximating these forms — not because they're philosophical, but because the data constrains the geometry of good representations. ...

Probing: What Do Models Actually Know?

Prior reading: Gradient Descent and Backpropagation | Why Sparsity? What Is Probing? Train a simple classifier (usually linear) on a model's internal representations to test whether specific information is encoded there. If a linear probe can extract "is this sentence toxic?" from layer 12 activations, the model represents toxicity at that layer. How It Works Freeze the model Extract activations at a chosen layer for a labeled dataset Train a linear (or shallow) classifier on those activations Measure accuracy High accuracy → the information is linearly accessible in the representation. ...

Why Sparsity, and How Do We Get It?

Prior reading: Gradient Descent and Backpropagation What Is Sparsity? A sparse representation has mostly zeros. Instead of every neuron firing for every input, only a small subset activates. Why Sparsity Is Good Interpretability: If only 50 of 10,000 features fire, you have a chance of understanding what they represent Efficiency: Sparse computations are cheaper Generalization: Sparse models tend to overfit less — they can't memorize with fewer active parameters Disentanglement: Sparse features tend to correspond to more independent, meaningful concepts How We Achieve Sparsity Architectural: ...