Probing: What Do Models Actually Know?

Prior reading: Gradient Descent and Backpropagation | Why Sparsity? What Is Probing? Train a simple classifier (usually linear) on a model's internal representations to test whether specific information is encoded there. If a linear probe can extract "is this sentence toxic?" from layer 12 activations, the model represents toxicity at that layer. How It Works Freeze the model Extract activations at a chosen layer for a labeled dataset Train a linear (or shallow) classifier on those activations Measure accuracy High accuracy → the information is linearly accessible in the representation. ...

June 18, 2025 · 2 min · Austin T. O'Quinn
.