Prior reading: Probing: What Do Models Actually Know?

The Platonic Representation Hypothesis

As models scale and train on more data, their internal representations appear to converge — different architectures, different modalities, even different training objectives produce increasingly similar feature spaces.

Are models discovering the "true structure" of the world?

What This Looks Like

  • Vision models and language models learn similar geometric structures
  • Larger models have more similar representations to each other than smaller ones
  • Cross-modal transfer works better as models scale

The Platonic Analogy

Plato's forms: there exist ideal, abstract representations of concepts that physical instances approximate. Near-capacity models may be approximating these forms — not because they're philosophical, but because the data constrains the geometry of good representations.

Safety Implications

If true (optimistic): Models are converging on a shared, stable representation of reality. Interpretability tools that work on one model may transfer. Safety properties verified in representation space may generalize.

If true (pessimistic): All models converge to the same blind spots. Systematic biases in representation become universal and harder to detect.

If partially true: The interesting case. Models converge on some structures but diverge on others — and the divergences may be exactly where safety-relevant edge cases live.

Open Questions

  • Is convergence an artifact of similar training data, or a deeper phenomenon?
  • Do representations converge on human-interpretable concepts, or on alien features that happen to be useful?
  • How do we test this empirically beyond correlation of learned features?
  • Does convergence depend on the learning algorithm? See What Happens to a Neural Network's Geometry When You Change How It Learns? for empirical evidence that it does — different credit assignment mechanisms produce radically different internal geometry even on the same data.
  • Can we even measure convergence reliably? See CKA Says Your Models Are the Same. They Aren't. — CKA, the dominant metric for measuring representational similarity, can report 0.97 between functionally incompatible models.