Llm-Chains

Prior reading: Safety as Capability Elicitation | Reachability Analysis | Platonic Forms What Is a Jailbreak? A jailbreak is an input that causes a safety-trained model to produce an output its safety training was designed to prevent. The refusal boundary is a decision surface in input space — a jailbreak is a point on the wrong side of it that the model fails to classify correctly. More precisely: if a model has been trained to refuse requests in some set $\mathcal{D}{\text{dangerous}}$, a jailbreak is an input $x$ such that $x$ is semantically in $\mathcal{D}{\text{dangerous}}$ but the model's refusal classifier maps it outside. ...