Probabilistic Security: Great Against Accidents, Useless Against Attackers
Prior reading: Jailbreaking | Reachability Analysis The Setup Suppose a model has a catastrophic failure mode — some input that causes it to produce a truly dangerous output. And suppose the probability that a random prompt triggers this failure is $10^{-100}$. Is this safe? It Depends Entirely on the Threat Model Good-Faith User (Random Inputs) If your users are cooperative — they're trying to use the model correctly and might occasionally stumble into bad prompts by accident — then $10^{-100}$ is absurdly safe. No one will ever randomly type the one prompt in $10^{100}$ that breaks the model. The sun will burn out first. ...