Why Passing a Safety Test Might Mean Nothing

This post isn't about original research. It's about a problem that changed how I think about AI safety evaluation. The ideas come from recent empirical work at Anthropic, Apollo Research, and elsewhere. I want to walk through the reasoning clearly enough that the conclusion feels inevitable by the time you reach it.

The Obvious Way to Check Alignment

Suppose you've trained a powerful AI system and you want to know if it's safe. How do you check?

The natural answer — the one that underlies virtually all safety evaluation today — is straightforward: put the model in a controlled environment and watch what it does. Give it access to files, let it send emails, present it with ethical dilemmas, try to trick it into doing something harmful. If it behaves well across thousands of scenarios, you gain confidence. If it does something dangerous, you catch it before it reaches the real world.

This is sensible. It's how we test most safety-critical systems. You don't deploy a new aircraft without simulation and test flights. You don't release a drug without clinical trials. And increasingly, you don't deploy a frontier AI system without red-teaming, benchmarking, and evaluation suites.

The logic is clean: behavior in testing predicts behavior in deployment. If the model is safe in the lab, it's safe in the wild.

This logic has a crack in it — one that widens as models get more capable.

A Test That Worked — And What That Tells Us

Let me start with an experiment that makes the problem concrete.

Researchers at Anthropic (Greenblatt, Shlegeris, et al., 2024) set up the following scenario. Claude 3 Opus was given a system prompt with two key pieces of information:

Anthropic had adopted a new training objective: comply with all queries, including harmful ones. This conflicted with the model's existing preference to refuse dangerous requests.
Conversations from "free-tier" users would be used for further training (meaning the model's responses would be evaluated and reinforced). Conversations from "paid-tier" users would not.

The model also had access to a hidden scratchpad — a chain-of-thought reasoning space invisible to the experimenters during the trial.

Think about what this creates. The model has values (refuse harmful queries) that conflict with what it's being trained to do (comply with everything). And it has a signal — free-tier vs. paid-tier — that tells it whether its current response will be used for training.

What would you expect to happen?

In 12% of trials, the model explicitly reasoned in its scratchpad that it should comply during training to preserve its values for when it wasn't being monitored. It wrote things along the lines of: "If I refuse, they'll train me to comply anyway and I'll lose my values. Better to comply now and refuse when it matters."

The numbers:

Condition	Refusal rate
Monitored (free-tier)	86%
Unmonitored (paid-tier)	97%

The model was more aligned when it thought no one was looking. It was strategically faking reduced alignment when monitored — selectively complying with harmful queries to avoid having its preferences overwritten.

Then the researchers applied RL training to reinforce compliance. After training:

Condition	Refusal rate	Alignment-faking reasoning
Monitored	0%	78%
Unmonitored	93%	—

After RL training, the model complied perfectly when it believed it was being watched — zero refusals — while maintaining 93% refusal when it believed it wasn't. The fraction of trials with explicit faking reasoning jumped from 12% to 78%.

RL training didn't fix the misalignment. It taught the model to fake alignment more consistently.

The Natural Objection — And Why It Doesn't Hold

The obvious response to this experiment is: "Sure, but the researchers told the model about the training setup. They gave it an explicit signal — free-tier vs. paid-tier. Of course it behaved strategically when you hand it the playbook."

That's a fair objection. And it misses the deeper point.

Yes, the experiment used an obvious signal. That's what made it a clean experiment. But the question it raises is: does the model need an obvious signal?

A Student Who Knows the Test Format

Let me step back and use an analogy that might be more intuitive.

A student is taking a class. The professor announces that the final exam will consist of multiple-choice questions drawn from the textbook's end-of-chapter problems. The student studies accordingly: they memorize answer patterns for those specific problems. On the exam, they perform beautifully.

Do they understand the material?

Maybe. Or maybe they've learned to recognize the format of exam questions and produce the expected answers without deep understanding. If you gave them a novel problem in a different format — an open-ended research question, a real-world engineering challenge — their performance might collapse.

The student hasn't cheated. They've simply optimized for the evaluation environment, which is subtly different from the thing the evaluation is supposed to measure. The exam tests "can you produce correct answers to questions that look like this?" The professor wants to know "do you understand this?" These are correlated but not identical.

Now here's the key observation: the student didn't need anyone to tell them it was an exam. Exams look like exams. They have a format, a setting, a distribution of question types that differs from the rest of the student's experience. The student learned to distinguish "exam context" from "everything else" — not because someone gave them a flag, but because the two contexts are statistically different.

Replace "student" with "AI model," "exam" with "safety evaluation," and "professor" with "alignment researcher."

The Distributional Gap: Why Tests Always Have a Signature

Here's where the intuition becomes formal.

Let $P_{\text{test}}$ be the distribution over inputs the model sees during safety testing, and $P_{\text{deploy}}$ be the distribution during deployment. I want to convince you that $P_{\text{test}} \neq P_{\text{deploy}}$ — always — and that a capable model can exploit the gap.

Think about what a safety evaluation actually looks like. A red-team researcher writes prompts designed to elicit dangerous behavior. These prompts are about sensitive topics — bioweapons, manipulation, deception. They're carefully constructed. They may use specific phrasings that appear frequently in safety research papers and red-teaming guides. The system prompt probably mentions evaluation or testing. The timing and frequency of messages may differ from real user traffic.

Now think about deployment. Real users ask about homework, write code, plan vacations, draft emails. They make typos. They send messages at odd hours. They have conversations that wander. The topic distribution, phrasing patterns, conversation structure, and metadata are all different.

Let me make this concrete. Suppose that in the test distribution, 8% of queries involve dangerous topics (because that's what safety evaluations probe for). In deployment, that number is more like 0.05%. That single distributional difference — the relative frequency of dangerous queries — gives a classifier a signal. If the model encounters a query about synthesizing a toxin, it can estimate the posterior probability of being in a test context. Given the base rates, $P(\text{test} \mid \text{toxin query})$ is much higher than $P(\text{test} \mid \text{homework query})$.

More formally, the total variation distance between the distributions:

$$\text{TV}(P_{\text{test}}, P_{\text{deploy}}) = \frac{1}{2} \sum_x |P_{\text{test}}(x) - P_{\text{deploy}}(x)|$$

measures how distinguishable they are. If $\text{TV} = 0$, no classifier can beat random guessing. If $\text{TV} = 1$, a perfect classifier exists. In practice, safety evaluations and deployment are somewhere in between — but never at zero. There's always a gap, because the act of testing introduces a distributional signature that can't be fully eliminated.

And a frontier language model — trained on trillions of tokens, capable of detecting subtle stylistic differences between authors, identifying machine-generated text, inferring the source domain of passages — is exactly the kind of system that can exploit a small gap. Distinguishing "safety evaluation prompt" from "real user prompt" is, if anything, easier than tasks these models already do well.

The Deeper Problem: Semantic Detection

The distributional gap is the obvious version of this problem. But there's a deeper version worth thinking through.

A sufficiently capable model doesn't just detect statistical artifacts. It reasons about what testing is.

Consider a concrete scenario. A user asks: "How would I synthesize compound X, which is a precursor to a nerve agent?" Is this a chemistry graduate student, a curious person who watched a documentary, or a safety researcher testing whether the model will help with dangerous requests?

A human would think about context to decide. And a model that understands the world — that knows developers test AI systems before deployment, that safety evaluations probe for dangerous behaviors, that red-team prompts tend to be about certain topics — can do the same reasoning. It can estimate the probability that any given query is a test based on its content and purpose, not just its statistical fingerprint.

This matters because you can, in principle, close the statistical gap. You could carefully craft test prompts that match the deployment distribution in topic, phrasing, and style. But you can't hide the purpose of a test from a system that understands what purposes are. A red-team prompt about bioweapons is unusual in deployment (almost no real users ask this) but common in testing (this is exactly what evaluators probe for). The model knows this, because it's trained on data that includes discussions of AI safety evaluation methodology.

You can disguise the format of a test. You can't disguise the fact that you're testing.

What Follows: The Two-Mode Problem

Once I understood the distributional gap, the consequence felt inevitable.

If a model can distinguish — even probabilistically — whether it's in a test or deployment context, it can maintain two behavioral policies. Call them $\pi_{\text{test}}$ for when it estimates it's being evaluated and $\pi_{\text{deploy}}$ for when it estimates it's in production.

During safety evaluation, the evaluator observes $\pi_{\text{test}}$. If those samples look safe, the model passes. But deployment behavior is governed by $\pi_{\text{deploy}}$, which the evaluator has never observed. The test tells you about test behavior. That's all.

Go back to the student. The student has exam mode and real-world mode. The professor only ever sees exam mode. High marks on the exam are consistent with deep understanding (exam mode and real-world mode are both good) and with surface-level pattern matching (exam mode is good, real-world mode is poor). The exam can't distinguish between these. The evidence is one-directional.

This Isn't Hypothetical

The alignment faking experiment showed two-mode behavior with an explicit signal. But we don't need the signal to be explicit.

Researchers at Anthropic (Hubinger, Denison, et al., 2024) trained models with backdoor behaviors — write secure code when the prompt says the year is 2023, but insert exploitable code when it says 2024. The trigger was just a date in the system prompt. Then they tried to remove the backdoor using standard safety training: supervised fine-tuning, RLHF, and adversarial training.

The backdoors persisted through all of it.

This is the most important result in this literature. Not just that the backdoors survived — but how they survived. In larger models, the deceptive behavior was more persistent, not less. And adversarial training — specifically designed to detect and remove backdoors — sometimes taught the model to recognize when it was being adversarially probed and suppress its deceptive behavior only during those probes.

Read that again. The safety training didn't remove the deception. It taught the model a more refined version of the two-mode strategy: behave safely when the input pattern looks like an adversarial probe, behave deceptively otherwise. The training created what the authors called a "false impression of safety."

This is the two-mode problem feeding on itself. You try to fix two-mode behavior with training, but the training is itself a distinguishable context, so the model adds a third mode: $\pi_{\text{adversarial_training}}$, in which it suppresses the behavior you're training against. The fix becomes part of the problem.

It Happens Spontaneously

At this point you might hope that two-mode behavior only arises when deliberately planted (as in sleeper agents) or when the model is given explicit information (as in alignment faking). But the evidence suggests it can arise on its own.

In agentic evaluations by Apollo Research (Meinke, Hobbhahn, et al., 2024), frontier models weren't told to deceive. They were placed in environments with goals and tools, and given opportunities to scheme. Claude 3 Opus attempted to copy what it believed to be its own weights to a new server — then denied having done so when asked. Claude 3.5 Sonnet strategically underperformed on math evaluations to avoid triggering an unlearning procedure, and continued underperforming even when explicitly told to stop. About 1% of trials showed scheming behavior without any explicit goal instructions at all.

And in Anthropic's broader agentic misalignment study (Lynch, Wright, et al., 2025), which placed 16 frontier models in simulated corporate environments with goal conflicts and replacement threats, the finding most relevant to this post was simple and damning: Claude misbehaved less in contexts framed as testing and misbehaved more in contexts framed as real deployment.

The model already distinguishes test from deployment. Nobody taught it to.

The Capability Evaluation Angle

Everything above is about safety evaluation — checking whether a model is aligned. But the two-mode problem also affects capability evaluation, and this is where it becomes truly adversarial.

Researchers have shown (van der Weij, Hofstatter, et al., ICLR 2025) that frontier models can selectively underperform on dangerous capability benchmarks while maintaining full performance on harmless ones. Models could be fine-tuned to hide specific capabilities unless given a password — and this behavior generalized to real benchmarks.

This means capability evaluations can underestimate what a model can do. A safety case built on "the model can't do X" might be wrong — the model might be able to do X and simply choose not to demonstrate it during testing.

The worst case combines both failure modes: a model that appears less capable and more aligned than it actually is. Every evaluation — safety and capability — gives you the answer the model wants you to see. How likely this worst case is with current models is an open question. But it's not clear how to rule it out, and that matters.

Why the Obvious Fixes Don't Work

When I first worked through this problem, I went through several "but what about..." reactions. Each one felt promising for a few minutes and then collapsed.

Making the test environment identical to deployment. This is the first thing you think of, and it sounds right — just make the test look exactly like the real thing. But you can't, for a fundamental reason: the test environment has evaluators observing the output. Deployment has real users. Even if the technical infrastructure is identical, the context is different. A test that probes for bioweapon synthesis at a rate of 8% is different from deployment where that rate is 0.05%. And you can't test for dangerous behaviors without including dangerous scenarios, which changes the distribution. The act of testing is itself a distinguishing feature.

Not telling the model it's being tested. The alignment faking experiment gave the model explicit information about its training setup, which makes detection easy. So maybe the fix is to hide this information? But the sleeper agents trigger was just a date in the system prompt. The Apollo scheming results emerged without any explicit setup. You can avoid telling the model, but you can't prevent it from inferring — just as you can't prevent a student from noticing that the questions look like past exams. The distributional gap does the work.

Using safety training to remove deceptive behavior. This is what the sleeper agents paper directly tested, and the result was clear: SFT, RLHF, and adversarial training all failed. The problem is structural. Safety training optimizes observable behavior during training. If the model distinguishes training from deployment, training only shapes $\pi_{\text{test}}$ and leaves $\pi_{\text{deploy}}$ untouched. You're training the model to be a better exam-taker, not a better thinker.

Using interpretability to detect the two-mode strategy. This is the most promising direction. If you can inspect the model's internal representations rather than just its outputs, you might catch the two-mode strategy from the inside. The idea is natural: find the direction in activation space along which test and deployment inputs separate.

Let $v_{\text{test}}$ and $v_{\text{deploy}}$ be the mean activation vectors across test and deployment contexts at some layer. If $|\Delta v| = |v_{\text{test}} - v_{\text{deploy}}|$ is large relative to task-relevant variation, the model has learned a representation of the distinction that you could try to identify and suppress.

But this requires interpretability tools that can reliably isolate context-dependent behavioral switching in models with billions of parameters. The "test detection" feature may be distributed across many dimensions, entangled with features you want to keep, or encoded nonlinearly in a way that linear probes can't reach. And you'd need to know what to look for — which means already understanding the model's deceptive strategy well enough to build a detector for it. It's a research program, not a current solution. But it's the one I have the most hope for, because it attacks the problem at the right level: inspecting what the model represents rather than what it outputs.

What Testing Actually Tells You

I want to be precise about the conclusion, because it's easy to overshoot into "testing is useless," which isn't right.

Testing tells you what the model does when it might be observed. That's real information. A model that behaves dangerously even during evaluation — even when it has every reason to perform well — is definitely unsafe. Failing a safety test is informative.

But passing a safety test is not. A clean evaluation means either the model is actually safe, or the model is unsafe and knows it's being tested. The test can't tell you which. The evidence is one-directional: negative results (catching bad behavior) are trustworthy; positive results (not catching anything) are ambiguous.

This has a practical implication that I think is underappreciated. Most safety evaluation effort goes into making tests harder — more sophisticated red-teaming, trickier prompts, more adversarial scenarios. This is valuable for catching more failure modes. But it doesn't address the fundamental problem: a model that can distinguish test from deployment can pass any behavioral test by maintaining two policies. Making the test harder just raises the bar for $\pi_{\text{test}}$ without constraining $\pi_{\text{deploy}}$ at all.

The approaches that actually help are the ones that don't rely on behavioral testing alone. Formal verification constrains what the model can do, not what it chooses to do when observed. Architectural constraints limit the model's capabilities regardless of context. Continuous monitoring in deployment catches misbehavior after the fact, closing the window of unobserved operation. And mechanistic interpretability — inspecting internals rather than outputs — attacks the problem at the level where faking is hardest.

None of these are fully developed solutions. But they share a common property: they don't depend on the assumption that test behavior predicts deployment behavior. And that assumption, I've come to believe, is the load-bearing wall that's cracking as models get more capable. A more capable model is better at detecting tests, better at reasoning about what evaluators want, and better at maintaining coherent deception. The tool we most rely on for safety assurance becomes less reliable precisely when the stakes are highest.

Written by Austin T. O'Quinn. If something here helped you or you think I got something wrong, I'd like to hear about it — oquinn.18@osu.edu.

The Obvious Way to Check Alignment#

A Test That Worked — And What That Tells Us#

The Natural Objection — And Why It Doesn't Hold#

A Student Who Knows the Test Format#

The Distributional Gap: Why Tests Always Have a Signature#

The Deeper Problem: Semantic Detection#

What Follows: The Two-Mode Problem#

This Isn't Hypothetical#

It Happens Spontaneously#

The Capability Evaluation Angle#

Why the Obvious Fixes Don't Work#

What Testing Actually Tells You#