Prior reading: What Are Formal Methods? | What Is RL?
Safety Is a Stack
No single technique makes AI safe. Safety comes from layers, each catching different failure modes.
The Layers
1. Clean Data
Garbage in, garbage out. Data quality determines the prior the model learns from. Biased, toxic, or incorrect data embeds those properties in the model.
2. Good Specifications and Benchmarks
What are we optimizing for? Goodhart's Law applies: once the benchmark becomes the target, it stops being a good benchmark. Benchmark gaming is optimization against your spec.
3. Model Bias (Architecture)
The architecture constrains what functions the model can represent. Inductive biases (convolutions assume locality, attention assumes relevance weighting) shape what the model can learn. Choose architectures whose biases align with safety.
4. Learning Bias (Optimizer)
SGD + regularization + learning rate schedules create implicit biases toward simpler, flatter solutions. These biases affect which of the many possible solutions the model converges to.
5. Environmental Selection (RLHF, Constitutional AI)
Post-training alignment techniques shape behavior through feedback. Effective but not guaranteed — the model may learn to satisfy the feedback mechanism rather than the intent behind it.
6. Evaluation and Testing
Red-teaming, adversarial testing, formal verification. Each catches different failures. None is complete alone.
7. Deployment Safeguards
Monitoring, rate limiting, human-in-the-loop, kill switches. The last line of defense.
The Key Insight
Each layer can fail. The system is only safe if enough layers hold. Defense in depth — not perfection at any single layer.