Platonic Forms in Near-Capacity Models

Prior reading: Probing: What Do Models Actually Know?

The Platonic Representation Hypothesis

As models scale and train on more data, their internal representations appear to converge — different architectures, different modalities, even different training objectives produce increasingly similar feature spaces.

Are models discovering the "true structure" of the world?

What This Looks Like

Vision models and language models learn similar geometric structures
Larger models have more similar representations to each other than smaller ones
Cross-modal transfer works better as models scale

The Platonic Analogy

Plato's forms: there exist ideal, abstract representations of concepts that physical instances approximate. Near-capacity models may be approximating these forms — not because they're philosophical, but because the data constrains the geometry of good representations.

Safety Implications

If true (optimistic): Models are converging on a shared, stable representation of reality. Interpretability tools that work on one model may transfer. Safety properties verified in representation space may generalize.

If true (pessimistic): All models converge to the same blind spots. Systematic biases in representation become universal and harder to detect.

If partially true: The interesting case. Models converge on some structures but diverge on others — and the divergences may be exactly where safety-relevant edge cases live.

Open Questions

Is convergence an artifact of similar training data, or a deeper phenomenon?
Do representations converge on human-interpretable concepts, or on alien features that happen to be useful?
How do we test this empirically beyond correlation of learned features?
Does convergence depend on the learning algorithm? See What Happens to a Neural Network's Geometry When You Change How It Learns? for empirical evidence that it does — different credit assignment mechanisms produce radically different internal geometry even on the same data.
Can we even measure convergence reliably? See CKA Says Your Models Are the Same. They Aren't. — CKA, the dominant metric for measuring representational similarity, can report 0.97 between functionally incompatible models.

The Platonic Representation Hypothesis#

What This Looks Like#

The Platonic Analogy#

Safety Implications#

Open Questions#

The Platonic Representation Hypothesis

What This Looks Like

The Platonic Analogy

Safety Implications

Open Questions