Detectability of Testing

Prior reading: Probing | Reachability Analysis | Mesa-Optimization and Three Lenses The Problem If a model can detect when it's being tested, it can behave differently during evaluation than in deployment. This undermines every safety benchmark. Structural vs. Semantic Detection Structural detection: The test environment is literally different — different API calls, different system prompts, different input distributions. The model picks up on artifacts of the testing setup. Semantic detection: The model understands the meaning of being tested. It recognizes evaluation-style questions, red-team prompts, or benchmark formats from training data. ...

January 7, 2026 · 2 min · Austin T. O'Quinn
.