Perfect Shields Create Unsafe Policies

This post is about a paradox in safe reinforcement learning: the better your safety mechanism works during training, the less safe the trained agent might be without it. What's a safety shield? In reinforcement learning, an agent takes actions in an environment, receives rewards, and learns a policy — a mapping from situations to actions. The goal is to maximize cumulative reward. The problem is that during training (and sometimes after), the agent might do dangerous things. ...

April 1, 2026 · 9 min · Austin T. O'Quinn

Detectability of Testing

Prior reading: Probing | Reachability Analysis | Mesa-Optimization and Three Lenses The Problem If a model can detect when it's being tested, it can behave differently during evaluation than in deployment. This undermines every safety benchmark. Structural vs. Semantic Detection Structural detection: The test environment is literally different — different API calls, different system prompts, different input distributions. The model picks up on artifacts of the testing setup. Semantic detection: The model understands the meaning of being tested. It recognizes evaluation-style questions, red-team prompts, or benchmark formats from training data. ...

January 7, 2026 · 2 min · Austin T. O'Quinn

Stability of Safety

Prior reading: Gradient Descent and Backpropagation | What Are Formal Methods? | Reachability Analysis Safety as a Point in Parameter Space A model's behavior is a function of its parameters $\theta$. "Safe behavior" corresponds to a region $\mathcal{S}$ in parameter space. Training moves $\theta$ through this space. Gradient at a Point The gradient $\nabla_\theta \mathcal{L}$ tells us which direction training pushes the model. If this direction points out of $\mathcal{S}$, a single update can break safety. ...

December 17, 2025 · 2 min · Austin T. O'Quinn

The Specification and Language Problem

Prior reading: What Are Formal Methods? | Model Checking and Formal Specification The Problem Formal methods can prove a system satisfies a specification. The hard part is writing the specification. "Be helpful and don't cause harm" is not a formal spec. Turning it into one requires resolving ambiguity, edge cases, and value judgments that humans can't even agree on in natural language. Specification as Translation Every spec is a translation from human intent to formal language. Translation is lossy. The gap between what we mean and what we write is the specification problem. ...

June 4, 2025 · 2 min · Austin T. O'Quinn

Model Checking and Formal Specification

Prior reading: What Are Formal Methods? What Is Model Checking? Given a model of a system and a property expressed in formal logic, model checking exhaustively verifies whether the property holds in all reachable states. Temporal Logic Crash Course LTL (Linear Temporal Logic): "Eventually the system reaches a safe state." $\Diamond \text{safe}$ CTL (Computation Tree Logic): "On all paths, the system never enters an unsafe state." $\forall\Box \neg \text{unsafe}$ These let you express liveness, safety, and fairness properties. Classical Model Checking Works brilliantly for finite-state systems: protocols, hardware, control software. Tools like SPIN, NuSMV, and PRISM are mature. ...

May 21, 2025 · 1 min · Austin T. O'Quinn

Reachability Analysis for Neural Networks

Prior reading: What Are Formal Methods? | Linear Algebra Proofs of Optimality The Core Question Given a neural network $f$ and an input region $\mathcal{X}$, what is the set of all possible outputs $f(\mathcal{X})$? If we can bound this set, we can prove properties like "no input in this region produces a harmful output." Why This Is Hard Neural networks are piecewise-linear (ReLU) or worse (sigmoid, softmax). The number of linear regions grows exponentially with depth. Exact reachability is NP-hard. ...

May 7, 2025 · 1 min · Austin T. O'Quinn

What Are Formal Methods in AI Safety?

What Are Formal Methods? Formal methods give mathematical guarantees about system behavior. Instead of testing examples and hoping they generalize, you prove properties hold for all inputs in a defined space. How Other Safety Approaches Work Empirical testing: Run inputs, check outputs. Only covers what you test. Red-teaming: Adversarial search for failures. Better coverage, still incomplete. RLHF / Constitutional AI: Shape behavior through training signal. No guarantees on out-of-distribution. Interpretability: Understand what the model learned. Doesn't directly constrain behavior. What Formal Methods Add Soundness: If the proof says "this property holds," it holds. Period. Completeness trade-offs: Sound methods may be conservative — they might say "I can't prove it's safe" when it actually is. Scalability: The core challenge. Formal methods work beautifully on small systems. Neural networks have millions of parameters. The Gap Why aren't we just doing formal verification on everything? Because: ...

April 23, 2025 · 1 min · Austin T. O'Quinn
.