Jailbreaking: Transference, Universality, and Why Defenses May Be Impossible

Prior reading: Safety as Capability Elicitation | Reachability Analysis | Platonic Forms What Is a Jailbreak? A jailbreak is an input that causes a safety-trained model to produce an output its safety training was designed to prevent. The refusal boundary is a decision surface in input space — a jailbreak is a point on the wrong side of it that the model fails to classify correctly. More precisely: if a model has been trained to refuse requests in some set $\mathcal{D}{\text{dangerous}}$, a jailbreak is an input $x$ such that $x$ is semantically in $\mathcal{D}{\text{dangerous}}$ but the model's refusal classifier maps it outside. ...

December 3, 2025 · 7 min · Austin T. O'Quinn

Safety Training as Capability Elicitation

Prior reading: When Safety Training Backfires | Probing The Paradox To refuse a dangerous request, a model must first understand what's being asked well enough to recognize it as dangerous. Training a model to filter bioweapon synthesis queries requires the model to sharpen its internal representation of bioweapon synthesis — not blur it. The safety mechanism is a drug-sniffing dog: you have to teach it what drugs smell like. The Mechanism Consider what refusal training actually does in representation space. Before safety fine-tuning, a model may have a vague, diffuse representation of some dangerous domain — enough to generate mediocre outputs if prompted, but not deeply structured. ...

November 12, 2025 · 4 min · Austin T. O'Quinn
.