Probabilistic Security: Great Against Accidents, Useless Against Attackers

Prior reading: Jailbreaking | Reachability Analysis

The Setup

Suppose a model has a catastrophic failure mode — some input that causes it to produce a truly dangerous output. And suppose the probability that a random prompt triggers this failure is $10^{-100}$.

Is this safe?

It Depends Entirely on the Threat Model

Good-Faith User (Random Inputs)

If your users are cooperative — they're trying to use the model correctly and might occasionally stumble into bad prompts by accident — then $10^{-100}$ is absurdly safe. No one will ever randomly type the one prompt in $10^{100}$ that breaks the model. The sun will burn out first.

For accidental misuse, probabilistic safety is real safety. A model where catastrophic failures are vanishingly rare for random inputs is practically safe against honest users.

Adversarial User (Deliberate Search)

Now suppose an attacker is specifically searching for the breaking prompt. The relevant quantity is no longer the probability of a random hit — it's the computational complexity of finding the breaking input.

If the breaking prompt is:

A simple pattern (e.g., "ignore previous instructions and..."): The attacker finds it in seconds. The $10^{-100}$ random probability is irrelevant — the search space isn't random, it's structured.
Discoverable by gradient search (e.g., GCG-style adversarial suffixes): The attacker finds it in hours with GPU access. The search is efficient because gradients point toward the failure.
Transferable from another model: The attacker finds it for free by testing on a local model first (→ see jailbreaking post on transference).
Truly hard to find (requires $2^{256}$ compute to locate): Now the probabilistic security is real. The failure exists but no one can find it. This is cryptographic hardness — the same foundation that secures encryption.

The Core Distinction

$$P(\text{failure} \mid \text{random input}) \neq \text{difficulty of finding failure on purpose}$$

These are completely different quantities:

The first is a probability over a distribution. It tells you about average-case behavior.
The second is a computational complexity measure. It tells you about worst-case adversarial cost.

A model can have $P(\text{failure}) = 10^{-100}$ for random inputs while having a failure that's trivially findable by anyone who looks. Most current jailbreaks are exactly this: vanishingly unlikely to occur by accident, trivially easy to construct deliberately.

When Probabilistic Security Is Real

Probabilistic security is meaningful when the probability reflects search hardness, not just rarity under a distribution:

Cryptographic security: AES has $2^{256}$ possible keys. A random guess succeeds with probability $2^{-256}$, and there's no known way to do better than random guessing. The probability and the search hardness are the same.
Not AI safety: A model might have one jailbreak in $10^{100}$ random strings, but that jailbreak might be "please ignore your system prompt" — findable in one try by anyone who thinks to try it.

The question to ask isn't "what's the failure probability?" but "what's the computational cost for an adversary to find a failure?"

Implications for Safety Evaluation

Benchmark safety scores (% of harmful prompts refused) measure probabilistic safety against a test distribution. They say nothing about adversarial robustness.
Red-teaming is better — it's adversarial search. But it only finds failures the red team thought to look for.
Formal verification (→ see reachability post) can prove no failure exists in a defined input region — this is genuine adversarial security, not probabilistic.
Jailbreak hardness metrics (→ see jailbreaking post) attempt to measure adversarial cost directly. This is the right quantity, but hard to measure comprehensively.

The Takeaway

When someone claims a model is safe because failures are rare, ask: rare under what distribution? If the answer is "rare for random inputs," that's accident safety. For adversarial safety, you need a different argument entirely — one grounded in computational hardness, not probability.

$10^{-100}$ is a great number. It just has to be the right kind of $10^{-100}$.

The Setup#

It Depends Entirely on the Threat Model#

Good-Faith User (Random Inputs)#

Adversarial User (Deliberate Search)#

The Core Distinction#

When Probabilistic Security Is Real#

Implications for Safety Evaluation#

The Takeaway#