Prior reading: Probing: What Do Models Actually Know?
The Platonic Representation Hypothesis
As models scale and train on more data, their internal representations appear to converge — different architectures, different modalities, even different training objectives produce increasingly similar feature spaces.
Are models discovering the "true structure" of the world?
What This Looks Like
- Vision models and language models learn similar geometric structures
- Larger models have more similar representations to each other than smaller ones
- Cross-modal transfer works better as models scale
The Platonic Analogy
Plato's forms: there exist ideal, abstract representations of concepts that physical instances approximate. Near-capacity models may be approximating these forms — not because they're philosophical, but because the data constrains the geometry of good representations.
Safety Implications
If true (optimistic): Models are converging on a shared, stable representation of reality. Interpretability tools that work on one model may transfer. Safety properties verified in representation space may generalize.
If true (pessimistic): All models converge to the same blind spots. Systematic biases in representation become universal and harder to detect.
If partially true: The interesting case. Models converge on some structures but diverge on others — and the divergences may be exactly where safety-relevant edge cases live.
Open Questions
- Is convergence an artifact of similar training data, or a deeper phenomenon?
- Do representations converge on human-interpretable concepts, or on alien features that happen to be useful?
- How do we test this empirically beyond correlation of learned features?
- Does convergence depend on the learning algorithm? See What Happens to a Neural Network's Geometry When You Change How It Learns? for empirical evidence that it does — different credit assignment mechanisms produce radically different internal geometry even on the same data.
- Can we even measure convergence reliably? See CKA Says Your Models Are the Same. They Aren't. — CKA, the dominant metric for measuring representational similarity, can report 0.97 between functionally incompatible models.