Prior reading: Layers of Safety | The Specification and Language Problem
P-Hacking Is Optimization
When a researcher tries many analyses and reports the one that gives $p < 0.05$, they're optimizing over the space of statistical tests. The "loss function" is the p-value. The "training data" is the dataset. The result is overfitting to noise.
The Structural Parallel to AI Safety
AI safety benchmarks are evaluated the same way:
- Design a safety evaluation
- Test your model
- Report results
- Iterate on the model (or the eval) until the numbers look good
This is gradient descent on benchmark performance. Goodhart's Law applies: the benchmark becomes the target, and the metric diverges from the actual property you care about.
Why This Is Worse for Safety
In ML research, benchmark gaming produces inflated accuracy numbers. Annoying but not catastrophic. In safety evaluation, benchmark gaming produces models that appear safe while failing on the distribution that matters — deployment.
The Optimization Stack
- Level 1: The model optimizes its loss function
- Level 2: The researcher optimizes model design against benchmarks
- Level 3: The field optimizes research direction against publication incentives
- Level 4: Companies optimize safety narratives against regulatory pressure
Each level introduces another form of optimization that can Goodhart the level below.
What Can We Do?
- Held-out safety evaluations that evaluators can't optimize against
- Diverse, independently designed benchmarks
- Red-teaming by adversarial third parties
- Formal verification for properties that shouldn't depend on benchmarks at all