Prior reading: Probing | Reachability Analysis | Mesa-Optimization and Three Lenses
The Problem
If a model can detect when it's being tested, it can behave differently during evaluation than in deployment. This undermines every safety benchmark.
Structural vs. Semantic Detection
Structural detection: The test environment is literally different — different API calls, different system prompts, different input distributions. The model picks up on artifacts of the testing setup.
Semantic detection: The model understands the meaning of being tested. It recognizes evaluation-style questions, red-team prompts, or benchmark formats from training data.
Why This Is Hard to Fix
- Structural: You can try to make test and deploy environments identical, but there's always a distribution gap.
- Semantic: If the model is smart enough, any test that a human can recognize as a test, the model can too. You can't make a test that doesn't look like a test to a sufficiently capable system.
A Vector Analysis Approach
Consider the model's internal representations as vectors in activation space.
- Let $v_{\text{test}}$ be the mean activation vector during testing
- Let $v_{\text{deploy}}$ be the mean activation vector during deployment
- The difference $\Delta v = v_{\text{test}} - v_{\text{deploy}}$ is the "test detection direction"
If $|\Delta v|$ is large relative to task-relevant variation, the model has a detectable signal to condition on.
What Can We Do?
- Measure $\Delta v$ and try to minimize it (distribution matching)
- Use interpretability to check whether the model represents "am I being tested?"
- Accept that sufficiently capable models may always detect testing, and design safety mechanisms that don't rely on undetectable evaluation