Defense-in-Depth

Prior reading: What Are Formal Methods? | What Is RL? Safety Is a Stack No single technique makes AI safe. Safety comes from layers, each catching different failure modes. The Layers 1. Clean Data Garbage in, garbage out. Data quality determines the prior the model learns from. Biased, toxic, or incorrect data embeds those properties in the model. 2. Good Specifications and Benchmarks What are we optimizing for? Goodhart's Law applies: once the benchmark becomes the target, it stops being a good benchmark. Benchmark gaming is optimization against your spec. ...