Mechanistic Interpretability: Circuits, Superposition, and Sparse Autoencoders
Prior reading: Probing | Why Sparsity? | Platonic Forms What Mechanistic Interpretability Is Trying to Do Mechanistic interpretability (mech interp) aims to reverse-engineer neural networks into human-understandable components. Not "what features does this layer represent?" (that's probing — → see probing post) but "what algorithm does this network implement, and how?" The analogy: probing tells you a chip has memory. Mech interp tells you it's a flip-flop built from NAND gates. Why It Matters for Safety If we can understand the mechanism by which a model produces an output, we can: ...