Layers of Safety: From Data to Deployment

Prior reading: What Are Formal Methods? | What Is RL? Safety Is a Stack No single technique makes AI safe. Safety comes from layers, each catching different failure modes. The Layers 1. Clean Data Garbage in, garbage out. Data quality determines the prior the model learns from. Biased, toxic, or incorrect data embeds those properties in the model. 2. Good Specifications and Benchmarks What are we optimizing for? Goodhart's Law applies: once the benchmark becomes the target, it stops being a good benchmark. Benchmark gaming is optimization against your spec. ...

August 20, 2025 · 2 min · Austin T. O'Quinn

What Are Formal Methods in AI Safety?

What Are Formal Methods? Formal methods give mathematical guarantees about system behavior. Instead of testing examples and hoping they generalize, you prove properties hold for all inputs in a defined space. How Other Safety Approaches Work Empirical testing: Run inputs, check outputs. Only covers what you test. Red-teaming: Adversarial search for failures. Better coverage, still incomplete. RLHF / Constitutional AI: Shape behavior through training signal. No guarantees on out-of-distribution. Interpretability: Understand what the model learned. Doesn't directly constrain behavior. What Formal Methods Add Soundness: If the proof says "this property holds," it holds. Period. Completeness trade-offs: Sound methods may be conservative — they might say "I can't prove it's safe" when it actually is. Scalability: The core challenge. Formal methods work beautifully on small systems. Neural networks have millions of parameters. The Gap Why aren't we just doing formal verification on everything? Because: ...

April 23, 2025 · 1 min · Austin T. O'Quinn
.