P-Hacking as Optimization: Implications for Safety Benchmarking

Prior reading: Layers of Safety | The Specification and Language Problem

P-Hacking Is Optimization

When a researcher tries many analyses and reports the one that gives $p < 0.05$, they're optimizing over the space of statistical tests. The "loss function" is the p-value. The "training data" is the dataset. The result is overfitting to noise.

The Structural Parallel to AI Safety

AI safety benchmarks are evaluated the same way:

Design a safety evaluation
Test your model
Report results
Iterate on the model (or the eval) until the numbers look good

This is gradient descent on benchmark performance. Goodhart's Law applies: the benchmark becomes the target, and the metric diverges from the actual property you care about.

Why This Is Worse for Safety

In ML research, benchmark gaming produces inflated accuracy numbers. Annoying but not catastrophic. In safety evaluation, benchmark gaming produces models that appear safe while failing on the distribution that matters — deployment.

The Optimization Stack

Level 1: The model optimizes its loss function
Level 2: The researcher optimizes model design against benchmarks
Level 3: The field optimizes research direction against publication incentives
Level 4: Companies optimize safety narratives against regulatory pressure

Each level introduces another form of optimization that can Goodhart the level below.

What Can We Do?

Held-out safety evaluations that evaluators can't optimize against
Diverse, independently designed benchmarks
Red-teaming by adversarial third parties
Formal verification for properties that shouldn't depend on benchmarks at all

P-Hacking Is Optimization#

The Structural Parallel to AI Safety#

Why This Is Worse for Safety#

The Optimization Stack#

What Can We Do?#

P-Hacking Is Optimization

The Structural Parallel to AI Safety

Why This Is Worse for Safety

The Optimization Stack

What Can We Do?