Jailbreaking

Depth-Robust Safety: What Happens When You Truncate a Language Model

This isn't a jailbreak. Nobody is crafting adversarial prompts. It's an engineer deploying a model efficiently — and maybe, without realizing it, stripping out the safety training. The question Early exit is the idea that you can speed up a language model by stopping computation partway through the network. A model like Mistral-7B has 32 layers. If the model is "confident enough" by layer 27, you skip the last 5 layers and save 15% of the compute. Systems like CALM and LayerSkip use this in production, skipping 30–50% of layers while maintaining quality. ...

Probabilistic Security: Great Against Accidents, Useless Against Attackers

Prior reading: Jailbreaking | Reachability Analysis The Setup Suppose a model has a catastrophic failure mode — some input that causes it to produce a truly dangerous output. And suppose the probability that a random prompt triggers this failure is $10^{-100}$. Is this safe? It Depends Entirely on the Threat Model Good-Faith User (Random Inputs) If your users are cooperative — they're trying to use the model correctly and might occasionally stumble into bad prompts by accident — then $10^{-100}$ is absurdly safe. No one will ever randomly type the one prompt in $10^{100}$ that breaks the model. The sun will burn out first. ...

Jailbreaking: Transference, Universality, and Why Defenses May Be Impossible

Prior reading: Safety as Capability Elicitation | Reachability Analysis | Platonic Forms What Is a Jailbreak? A jailbreak is an input that causes a safety-trained model to produce an output its safety training was designed to prevent. The refusal boundary is a decision surface in input space — a jailbreak is a point on the wrong side of it that the model fails to classify correctly. More precisely: if a model has been trained to refuse requests in some set $\mathcal{D}{\text{dangerous}}$, a jailbreak is an input $x$ such that $x$ is semantically in $\mathcal{D}{\text{dangerous}}$ but the model's refusal classifier maps it outside. ...

Safety Training as Capability Elicitation

Prior reading: When Safety Training Backfires | Probing The Paradox To refuse a dangerous request, a model must first understand what's being asked well enough to recognize it as dangerous. Training a model to filter bioweapon synthesis queries requires the model to sharpen its internal representation of bioweapon synthesis — not blur it. The safety mechanism is a drug-sniffing dog: you have to teach it what drugs smell like. The Mechanism Consider what refusal training actually does in representation space. Before safety fine-tuning, a model may have a vague, diffuse representation of some dangerous domain — enough to generate mediocre outputs if prompted, but not deeply structured. ...