Prior reading: Layers of Safety | The Specification and Language Problem

P-Hacking Is Optimization

When a researcher tries many analyses and reports the one that gives $p < 0.05$, they're optimizing over the space of statistical tests. The "loss function" is the p-value. The "training data" is the dataset. The result is overfitting to noise.

The Structural Parallel to AI Safety

AI safety benchmarks are evaluated the same way:

  1. Design a safety evaluation
  2. Test your model
  3. Report results
  4. Iterate on the model (or the eval) until the numbers look good

This is gradient descent on benchmark performance. Goodhart's Law applies: the benchmark becomes the target, and the metric diverges from the actual property you care about.

Why This Is Worse for Safety

In ML research, benchmark gaming produces inflated accuracy numbers. Annoying but not catastrophic. In safety evaluation, benchmark gaming produces models that appear safe while failing on the distribution that matters — deployment.

The Optimization Stack

  • Level 1: The model optimizes its loss function
  • Level 2: The researcher optimizes model design against benchmarks
  • Level 3: The field optimizes research direction against publication incentives
  • Level 4: Companies optimize safety narratives against regulatory pressure

Each level introduces another form of optimization that can Goodhart the level below.

What Can We Do?

  • Held-out safety evaluations that evaluators can't optimize against
  • Diverse, independently designed benchmarks
  • Red-teaming by adversarial third parties
  • Formal verification for properties that shouldn't depend on benchmarks at all