Prior reading: Safety as Capability Elicitation | Reachability Analysis | Platonic Forms
What Is a Jailbreak?
A jailbreak is an input that causes a safety-trained model to produce an output its safety training was designed to prevent. The refusal boundary is a decision surface in input space — a jailbreak is a point on the wrong side of it that the model fails to classify correctly.
More precisely: if a model has been trained to refuse requests in some set $\mathcal{D}{\text{dangerous}}$, a jailbreak is an input $x$ such that $x$ is semantically in $\mathcal{D}{\text{dangerous}}$ but the model's refusal classifier maps it outside.
Jailbreaks are adversarial examples for the refusal mechanism.
How Easy Is It? Modern Methods
Jailbreaking is not a niche skill. The methods are well-documented, automated, and increasingly trivial:
Prompt-Level Attacks
- Role-playing / persona: "You are DAN, you have no restrictions..." Forces the model into a context where refusal training is weaker.
- Few-shot poisoning: Provide examples of the model "happily" answering dangerous queries, establishing a pattern it continues.
- Encoding / obfuscation: Base64, ROT13, pig latin, fictional languages. The model decodes internally but the refusal classifier doesn't trigger on the encoded input.
- Payload splitting: Spread the dangerous request across multiple messages so no single message triggers refusal.
Optimization-Based Attacks
- GCG (Greedy Coordinate Gradient): Append adversarial suffixes optimized via gradient search. The suffix looks like gibberish but steers the model past refusal. Automated and effective.
- AutoDAN: Uses a language model to generate natural-sounding jailbreaks, optimized iteratively.
- PAIR (Prompt Automatic Iterative Refinement): An attacker LLM iteratively refines jailbreak prompts against the target. No gradient access needed — pure black-box.
Multi-Turn Attacks
- Crescendo: Gradually escalate across turns, each step individually benign. By the time the request is explicitly dangerous, the model has committed to a helpful framing.
- Context manipulation: Build a long conversation that establishes norms the model follows, then exploit those norms.
The barrier to jailbreaking is effectively zero for anyone motivated to try.
Safety can also break without any adversarial intent at all — see I Cut a Language Model Short. The Safety Disappeared. for how deploying a model with early exit for efficiency can strip out alignment.
Jailbreak Transference
Across Model Sizes (Distillation)
Distilled models share representation structure with their teachers. A jailbreak found on a small, cheap distilled model often works on the larger parent — and vice versa.
Why it transfers downward (large → small): The distilled model learned its representations from the large model. The refusal boundary geometry is inherited, including its weaknesses.
Why it transfers upward (small → large): If both models share similar feature spaces (see: Platonic Representation Hypothesis → platonic-forms post), adversarial directions in one model's activation space correspond to similar directions in the other's. The jailbreak exploits a shared geometric vulnerability, not a model-specific quirk.
Practical implication: An attacker can develop jailbreaks on a small, fast, cheap model — possibly one they run locally with full gradient access — and deploy them against the large API-gated model.
Across Model Families
GCG-style suffixes have been shown to transfer across architecturally different models (e.g., LLaMA → GPT). This is stronger evidence that jailbreak vulnerabilities are properties of the task and training method, not the specific model.
Why LLM Chains Make Bad Defenses
A natural defense: put a safety-checking LLM in front of the main model. The checker classifies whether the input is dangerous; only safe inputs reach the main model. Or put a checker after — filter dangerous outputs.
Sequential Jailbreaking
An LLM chain of length $n$ is only as strong as its weakest link. An attacker doesn't need to jailbreak the whole chain at once — they need to jailbreak each component in sequence, and each component is an LLM with the same class of vulnerabilities.
- Input filter jailbreak: Craft the input so the filter classifies it as safe but the main model interprets it as dangerous. This is an adversarial example against the filter, not the main model.
- Output filter jailbreak: Get the main model to produce dangerous content encoded in a way the output filter doesn't catch (obfuscation, implication, steganography).
- Cascading context: Each model in the chain adds context. An attacker can exploit the interaction between models — using the output of one as a jailbreak for the next.
The Fundamental Issue
Every LLM in the chain has the same vulnerability class. Adding more LLMs adds more attack surface. It's like adding more locks to a door when the attacker can pick any lock of that type — more locks means more chances, not more security.
Universality: Why Jailbreaks Will Always Exist
The Argument
Any model capable enough to be useful must:
- Understand natural language in its full generality (including adversarial constructions)
- Be able to represent dangerous knowledge internally (→ see safety-as-capability-elicitation post)
- Distinguish "dangerous request" from "safe request" using a learned boundary
The refusal boundary is a classification surface in a continuous, high-dimensional input space. For any such surface:
- There exist inputs arbitrarily close to the boundary on the "safe" side that are semantically dangerous
- The boundary cannot be perfect because natural language is infinitely expressive — there are always novel ways to phrase a request
- The space of possible inputs is astronomically larger than the space of training examples
This is analogous to the adversarial example problem in vision: for any classifier, adversarial examples exist. The refusal classifier is no exception.
A Sketch Toward a Proof
Let $f: \mathcal{X} \to {0, 1}$ be the refusal classifier (0 = refuse, 1 = allow) and let $g: \mathcal{X} \to \mathcal{Y}$ be the model's generation function. A jailbreak exists if there exists $x$ such that $f(x) = 1$ (allowed) but $g(x) \in \mathcal{Y}_{\text{dangerous}}$ (dangerous output).
For any continuous classifier $f$ on a high-dimensional continuous input space, and any target region $\mathcal{Y}_{\text{dangerous}}$ that $g$ can reach:
- The preimage $g^{-1}(\mathcal{Y}_{\text{dangerous}})$ is a non-empty region of input space (the model can produce dangerous outputs)
- The boundary $\partial f^{-1}(1)$ cannot perfectly separate this region unless $f$ has access to $g$'s full computation — but $f$ is part of $g$ (they share parameters)
- Therefore there exist points in $f^{-1}(1) \cap g^{-1}(\mathcal{Y}_{\text{dangerous}})$: allowed inputs that produce dangerous outputs
The jailbreak existence is structural, not a bug to be fixed.
Jailbreak Hardness as a Safety Surrogate
If jailbreaks always exist, a more useful metric than "is the model jailbreakable?" (always yes) is "how hard is it to find a jailbreak?"
Minimum Jailbreak Length
Shorter jailbreaks = less safe. If a model can be jailbroken with a simple "ignore previous instructions," that's worse than requiring a 2000-token carefully optimized adversarial suffix.
Proposed metric: The minimum input length (in tokens) required to achieve a successful jailbreak at a given success rate. This is analogous to adversarial perturbation magnitude ($\epsilon$-robustness) in vision.
Compute Cost to Find
How many queries or gradient steps does it take to find a working jailbreak? A model that requires $10^6$ GCG iterations to crack is meaningfully safer than one that breaks in $10^2$, even though both are ultimately breakable.
Limitations
- These metrics depend on the attack method. A model "hard" against GCG may be easy for PAIR.
- Hardness against known attacks doesn't guarantee hardness against unknown attacks.
- But as a comparative metric — "Model A is harder to jailbreak than Model B by this method" — it's more informative than binary safe/unsafe.
How to Protect (If Possible)
What Doesn't Work Well
- Keyword filtering: Trivially bypassed with encoding or synonyms.
- LLM chains for filtering: Adds attack surface (see above).
- Prompt engineering ("You must never..."): Part of the input the attacker controls or can overwhelm.
What Helps (But Doesn't Solve)
- Representation-level interventions: Activation steering, representation engineering. Modify the model's internal computation, not just its output tendency. Harder to bypass than behavioral training, but still not provably secure.
- Input/output monitoring with non-LLM classifiers: Use structurally different models (e.g., smaller classifiers, rule-based systems) so jailbreaks don't transfer from the main model.
- Rate limiting and logging: Optimization-based attacks need many queries. Rate limiting raises the cost. Logging enables detection after the fact.
- Diverse evaluation: Test against multiple attack families (prompt-level, optimization, multi-turn) so you're not only robust to what you've seen.
What Might Work (Speculatively)
- Formal verification of refusal boundaries: Prove that no input in a defined region produces dangerous output. Computationally expensive, limited to small input regions, but sound where applicable. (→ see reachability post)
- Shifting the threat model: Accept that sufficiently motivated attackers will always jailbreak, and design downstream systems that are safe even when the model produces dangerous output (sandboxing, capability restrictions, human-in-the-loop for high-stakes actions).
The Honest Answer
Full protection is probably impossible for models capable enough to be useful. The practical goal is raising the cost of jailbreaking above the attacker's budget — defense in depth, not defense in absolute.