AI Cracking a Research Paper: Gaming Peer Review

Prior reading: P-Hacking and Benchmarks | The Specification Problem The Setup AI writing assistants can now help researchers improve papers. Sounds good. But "improve" means "optimize for acceptance" — and acceptance is determined by reviewers with biases, time constraints, and heuristics. What the AI Optimizes Framing: Present results in the most favorable light Buzzwords: Match the vocabulary reviewers respond to Structure: Follow templates that reviewers associate with quality Claims: Calibrate confidence to what reviewers will accept without pushing back Related work: Cite the likely reviewers' papers None of this is about being more true. It's about being more accepted. ...

March 15, 2026 · 2 min · Austin T. O'Quinn

Stability of Safety

Prior reading: Gradient Descent and Backpropagation | What Are Formal Methods? | Reachability Analysis Safety as a Point in Parameter Space A model's behavior is a function of its parameters $\theta$. "Safe behavior" corresponds to a region $\mathcal{S}$ in parameter space. Training moves $\theta$ through this space. Gradient at a Point The gradient $\nabla_\theta \mathcal{L}$ tells us which direction training pushes the model. If this direction points out of $\mathcal{S}$, a single update can break safety. ...

December 17, 2025 · 2 min · Austin T. O'Quinn

P-Hacking as Optimization: Implications for Safety Benchmarking

Prior reading: Layers of Safety | The Specification and Language Problem P-Hacking Is Optimization When a researcher tries many analyses and reports the one that gives $p < 0.05$, they're optimizing over the space of statistical tests. The "loss function" is the p-value. The "training data" is the dataset. The result is overfitting to noise. The Structural Parallel to AI Safety AI safety benchmarks are evaluated the same way: Design a safety evaluation Test your model Report results Iterate on the model (or the eval) until the numbers look good This is gradient descent on benchmark performance. Goodhart's Law applies: the benchmark becomes the target, and the metric diverges from the actual property you care about. ...

October 1, 2025 · 2 min · Austin T. O'Quinn

Mesa-Optimization and the Optimization Pressure Spectrum

Prior reading: Gradient Descent and Backpropagation | Decision Theory for AI Safety | What Is RL? The Question Why does an AI system behave the way it does? And why do optimizers keep creating sub-optimizers with misaligned goals? Three frameworks give different (complementary) answers. But first — some terminology that trips people up. Mesa vs. Meta I've seen these two confused often enough that it's worth being explicit. I had to look up what "mesa" even meant the first time I encountered it. ...

August 6, 2025 · 13 min · Austin T. O'Quinn

Linear Algebra Proofs of Optimality for Gradient Descent

Prior reading: Gradient Descent and Backpropagation When Gradient Descent Is Provably Optimal For convex functions, gradient descent converges to the global minimum. For strongly convex functions, it converges exponentially fast. The proofs are clean linear algebra. The Convex Case A function $f$ is convex if for all $x, y$: $$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)$$ This means every local minimum is a global minimum. Gradient descent can't get stuck. ...

January 28, 2025 · 2 min · Austin T. O'Quinn

Gradient Descent, Backpropagation, and the Misconceptions That Tripped Me Up

This post starts from the ordinary derivative and builds to gradient descent for neural networks. If you already know multivariable calculus, you can skip to Why the Gradient is Steepest. If you're here for the ML connection, skip to Applied to Machine Learning. But I'd encourage reading the whole thing — several of the "obvious" steps are where my own misconceptions lived. Starting from One Dimension The Derivative as a Rate You have a function $f(x)$. The derivative $f'(x)$ tells you: if you nudge $x$ by a tiny amount $\Delta x$, how much does $f$ change? ...

January 15, 2025 · 36 min · Austin T. O'Quinn
.