Prior reading: Probing | Why Sparsity? | Platonic Forms

What Mechanistic Interpretability Is Trying to Do

Mechanistic interpretability (mech interp) aims to reverse-engineer neural networks into human-understandable components. Not "what features does this layer represent?" (that's probing — → see probing post) but "what algorithm does this network implement, and how?"

The analogy: probing tells you a chip has memory. Mech interp tells you it's a flip-flop built from NAND gates.

Why It Matters for Safety

If we can understand the mechanism by which a model produces an output, we can:

  • Verify that it's reasoning correctly, not just producing correct-looking outputs
  • Detect deceptive computation (mesa-optimization) by inspecting the algorithm, not just the behavior
  • Identify dangerous capabilities before they manifest in outputs
  • Build formal guarantees on model behavior grounded in mechanistic understanding

Without mechanistic understanding, safety is black-box testing — and we know testing is incomplete (→ see jailbreaking post).


Part I: Circuits

The Circuits Hypothesis

Neural networks implement interpretable algorithms composed of features (meaningful directions in activation space) connected by circuits (subgraphs of the network that implement specific computations).

This is a hypothesis, not a theorem. But growing evidence supports it.

What a Circuit Looks Like

A circuit is a computational subgraph: specific attention heads and MLP neurons that together implement a recognizable function.

Example — Induction Heads: A two-attention-head circuit that implements pattern completion:

  • Head 1 (previous token head): Attends to the token before the current one, creating a "this token was preceded by X" signal
  • Head 2 (induction head): Searches for previous instances of the current token, then copies what came after those instances

Together: if the model has seen "A B ... A", the induction circuit predicts "B". This is a mechanistic explanation of in-context learning — not a statistical correlation, but an identifiable algorithm.

How Circuits Are Found

Activation patching / causal tracing: Run the model on two inputs (clean and corrupted). Systematically patch activations from the clean run into the corrupted run, one component at a time. The components whose patching restores the correct output are part of the circuit.

Path patching: A finer-grained version that traces causal influence along specific edges (attention head → MLP → next head) rather than just nodes.

Automated circuit discovery: Tools like ACDC (Automatic Circuit DisCovery) automate the search by iteratively pruning components that don't affect the output.

What Circuits Have Been Found

  • Induction heads: In-context pattern completion (described above)
  • Indirect object identification: "When Mary and John went to the store, John gave a drink to ___" — a circuit identifies the indirect object
  • Greater-than: A circuit that computes whether one number is greater than another
  • Modular arithmetic: In small models trained on modular addition, circuits implement discrete Fourier transforms

Limitations

  • Circuits have been found primarily in small models or for narrow behaviors in large models
  • The full circuit for a complex behavior (like "answer a medical question correctly") may involve most of the network — in which case "the circuit" isn't a useful abstraction
  • Circuit-level understanding doesn't obviously scale: finding circuits in GPT-2 doesn't give you circuits in GPT-4

Part II: Superposition

The Problem

Neural networks have more concepts than dimensions. A model with a 4096-dimensional residual stream might represent millions of distinct features. How?

What Superposition Is

Features are represented as directions in activation space, and multiple features share the same dimensions via sparse, overlapping encodings. Each feature is a direction, but directions can be non-orthogonal.

This works because features are sparse — most features are inactive for any given input. If feature A and feature B are rarely active simultaneously, they can share dimensions without interference. It's like a compressed sensing code.

Why Superposition Makes Interpretability Hard

If each neuron corresponded to one feature, interpretability would be (comparatively) easy. But in superposition:

  • Individual neurons are polysemantic — they respond to multiple unrelated features
  • The "true" features are not axis-aligned — they're directions in high-dimensional space
  • You can't read off what the model represents by looking at individual neurons
  • The number of features may be much larger than the number of dimensions, so you can't enumerate them by examining the model's parameters directly

The Geometry

Consider features as vectors in $\mathbb{R}^d$. In an ideal world, $d$ features would be orthogonal (one per dimension). In superposition, $n \gg d$ features are packed into $d$ dimensions with near-orthogonality.

Random vectors in high-dimensional space are nearly orthogonal — $\cos(\theta) \approx 0$ for random vectors in $\mathbb{R}^{4096}$. So you can pack many features with low interference. The interference is small per feature, but it's nonzero — and it accumulates.

Superposition and Safety

Superposition means dangerous features may be entangled with benign ones. You can't cleanly "remove" a dangerous capability without disturbing safe capabilities that share dimensions. This is a structural barrier to targeted safety interventions at the representation level.


Part III: Sparse Autoencoders (SAEs)

The Idea

If features are sparse directions in superposition, we need a tool that decomposes a model's activations into those sparse components. Enter the sparse autoencoder.

An SAE learns to decompose activation vector $x \in \mathbb{R}^d$ into a sparse code $z \in \mathbb{R}^m$ where $m \gg d$:

$$z = \text{ReLU}(W_{\text{enc}} x + b_{\text{enc}})$$ $$\hat{x} = W_{\text{dec}} z + b_{\text{dec}}$$

Train to minimize reconstruction error with a sparsity penalty:

$$\mathcal{L} = |x - \hat{x}|^2 + \lambda |z|_1$$

The L1 penalty enforces sparsity: most entries of $z$ are zero for any given input. The nonzero entries correspond to the active features.

What SAEs Find

When trained on transformer activations, SAEs recover features that are:

  • Monosemantic: Each SAE feature responds to one interpretable concept (unlike polysemantic neurons)
  • Sparse: Only a few features active per input
  • Meaningful: Features correspond to recognizable concepts — "this text is about Python code," "this text discusses legal contracts," "this token is the end of a sentence"

Anthropic's work on Claude found millions of features via SAEs, organized into interpretable clusters.

What SAEs Don't Solve

  • Completeness: SAEs find some features but may miss others. There's no guarantee the SAE dictionary captures all safety-relevant features.
  • Faithfulness: The SAE's decomposition may not reflect the model's actual computation. The features could be artifacts of the SAE architecture rather than true features of the model.
  • Scale: Training SAEs on large models is expensive. The feature dictionaries are enormous. Analyzing millions of features is its own research problem.
  • Circuits: SAEs find features but not the circuits that connect them. Knowing that "deception" is a feature doesn't tell you what computation activates it.
  • Superposition isn't fully solved: SAEs assume a linear decomposition. If superposition is nonlinear, SAEs miss it.

Part IV: The Frontier and Open Problems

Connecting Features to Circuits

The current gap: SAEs give you features, circuit analysis gives you computations, but connecting the two — "this circuit uses these SAE features to implement this algorithm" — is unsolved at scale.

Feature Steering and Safety Applications

If SAE features include safety-relevant concepts, you can potentially:

  • Monitor: Track activation of dangerous features at runtime
  • Steer: Clamp or suppress specific features to change behavior (representation engineering / activation steering)
  • Verify: Prove that certain feature patterns can't arise for certain inputs (→ connects to reachability post)

But all of this depends on SAE features being faithful and complete — which isn't guaranteed.

The Scaling Question

Mech interp has produced detailed results on small models and narrow behaviors. The critical question: does it scale to frontier models doing complex reasoning? Possible answers:

  • Optimistic: The same techniques work but need more compute. Features and circuits are universal.
  • Pessimistic: Large model computation is fundamentally holistic — there are no clean circuits, and the "circuit" for any interesting behavior is most of the network.
  • Middle ground: Some behaviors have clean circuits (and those are verifiable), while others don't (and those need different safety approaches). The art is knowing which is which.

Mech Interp and Formal Methods

The connection to formal verification (→ see formal methods section) is direct:

  • Features give you the variables to write specifications about
  • Circuits give you the structure to verify
  • Reachability analysis can bound feature activations
  • Model checking can verify temporal properties of circuits

Mech interp without formal methods gives you understanding without guarantees. Formal methods without mech interp gives you guarantees about quantities you can't interpret. Together they might give you interpretable guarantees — which is the goal.


Where This Field Is (Honestly)

Mechanistic interpretability is the most active frontier in AI safety research. It has produced genuinely novel scientific understanding of how neural networks compute. But:

  • We can explain small models well and large models poorly
  • We can find features but struggle to find complete circuits at scale
  • SAEs are powerful but their faithfulness is debated
  • The gap between "we understand this circuit in a toy model" and "we can verify this property of a frontier model" is vast

The bet is that this gap closes as methods improve. The risk is that it doesn't — that large-model computation is intrinsically resistant to human-interpretable decomposition. In that case, safety can't rely on understanding and must rely on behavioral constraints and formal methods over opaque systems.