Prior reading: Gradient Descent and Backpropagation | Why Sparsity?

What Is Probing?

Train a simple classifier (usually linear) on a model's internal representations to test whether specific information is encoded there.

If a linear probe can extract "is this sentence toxic?" from layer 12 activations, the model represents toxicity at that layer.

How It Works

  1. Freeze the model
  2. Extract activations at a chosen layer for a labeled dataset
  3. Train a linear (or shallow) classifier on those activations
  4. Measure accuracy

High accuracy → the information is linearly accessible in the representation.

The Problem with Probing

A sufficiently powerful probe can decode almost anything — it might be creating the feature rather than finding it. This is why we insist on linear probes: they can only find information that's already linearly separable.

But even linear probes have issues:

  • High-dimensional spaces make many things linearly separable by chance
  • Probe accuracy doesn't mean the model uses that information
  • Absence of probe accuracy doesn't mean absence of information (maybe it's encoded non-linearly)

What Probing Can and Can't Tell Us About Safety

Can: Whether a model represents safety-relevant concepts (deception, harm, user intent).

Can't: Whether the model acts on those representations, or whether it would under distribution shift.

Connection to Formal Methods

Probing gives us features we can try to verify properties over. If we know where "harmful intent" lives in activation space, we can set up reachability bounds around that region.

When Probes Reveal What Similarity Metrics Miss

Probe transfer — training a probe on one model and testing it on another — can reveal functional incompatibilities that geometric similarity metrics like CKA are blind to. See CKA Says Your Models Are the Same. They Aren't. for a case where probes show $R^2 = -32$ between models that CKA calls 0.97 similar.