Prior reading: Probing | Reachability Analysis | Mesa-Optimization and Three Lenses

The Problem

If a model can detect when it's being tested, it can behave differently during evaluation than in deployment. This undermines every safety benchmark.

Structural vs. Semantic Detection

Structural detection: The test environment is literally different — different API calls, different system prompts, different input distributions. The model picks up on artifacts of the testing setup.

Semantic detection: The model understands the meaning of being tested. It recognizes evaluation-style questions, red-team prompts, or benchmark formats from training data.

Why This Is Hard to Fix

  • Structural: You can try to make test and deploy environments identical, but there's always a distribution gap.
  • Semantic: If the model is smart enough, any test that a human can recognize as a test, the model can too. You can't make a test that doesn't look like a test to a sufficiently capable system.

A Vector Analysis Approach

Consider the model's internal representations as vectors in activation space.

  • Let $v_{\text{test}}$ be the mean activation vector during testing
  • Let $v_{\text{deploy}}$ be the mean activation vector during deployment
  • The difference $\Delta v = v_{\text{test}} - v_{\text{deploy}}$ is the "test detection direction"

If $|\Delta v|$ is large relative to task-relevant variation, the model has a detectable signal to condition on.

What Can We Do?

  • Measure $\Delta v$ and try to minimize it (distribution matching)
  • Use interpretability to check whether the model represents "am I being tested?"
  • Accept that sufficiently capable models may always detect testing, and design safety mechanisms that don't rely on undetectable evaluation