Benchmarks

Prior reading: Layers of Safety | The Specification and Language Problem P-Hacking Is Optimization When a researcher tries many analyses and reports the one that gives $p < 0.05$, they're optimizing over the space of statistical tests. The "loss function" is the p-value. The "training data" is the dataset. The result is overfitting to noise. The Structural Parallel to AI Safety AI safety benchmarks are evaluated the same way: Design a safety evaluation Test your model Report results Iterate on the model (or the eval) until the numbers look good This is gradient descent on benchmark performance. Goodhart's Law applies: the benchmark becomes the target, and the metric diverges from the actual property you care about. ...

Benchmarks

P-Hacking as Optimization: Implications for Safety Benchmarking

Layers of Safety: From Data to Deployment