Goodhart

AI Cracking a Research Paper: Gaming Peer Review

Prior reading: P-Hacking and Benchmarks | The Specification Problem The Setup AI writing assistants can now help researchers improve papers. Sounds good. But "improve" means "optimize for acceptance" — and acceptance is determined by reviewers with biases, time constraints, and heuristics. What the AI Optimizes Framing: Present results in the most favorable light Buzzwords: Match the vocabulary reviewers respond to Structure: Follow templates that reviewers associate with quality Claims: Calibrate confidence to what reviewers will accept without pushing back Related work: Cite the likely reviewers' papers None of this is about being more true. It's about being more accepted. ...

P-Hacking as Optimization: Implications for Safety Benchmarking

Prior reading: Layers of Safety | The Specification and Language Problem P-Hacking Is Optimization When a researcher tries many analyses and reports the one that gives $p < 0.05$, they're optimizing over the space of statistical tests. The "loss function" is the p-value. The "training data" is the dataset. The result is overfitting to noise. The Structural Parallel to AI Safety AI safety benchmarks are evaluated the same way: Design a safety evaluation Test your model Report results Iterate on the model (or the eval) until the numbers look good This is gradient descent on benchmark performance. Goodhart's Law applies: the benchmark becomes the target, and the metric diverges from the actual property you care about. ...

Layers of Safety: From Data to Deployment

Prior reading: What Are Formal Methods? | What Is RL? Safety Is a Stack No single technique makes AI safe. Safety comes from layers, each catching different failure modes. The Layers 1. Clean Data Garbage in, garbage out. Data quality determines the prior the model learns from. Biased, toxic, or incorrect data embeds those properties in the model. 2. Good Specifications and Benchmarks What are we optimizing for? Goodhart's Law applies: once the benchmark becomes the target, it stops being a good benchmark. Benchmark gaming is optimization against your spec. ...