Layers of Safety: From Data to Deployment

Prior reading: What Are Formal Methods? | What Is RL?

Safety Is a Stack

No single technique makes AI safe. Safety comes from layers, each catching different failure modes.

The Layers

1. Clean Data

Garbage in, garbage out. Data quality determines the prior the model learns from. Biased, toxic, or incorrect data embeds those properties in the model.

2. Good Specifications and Benchmarks

What are we optimizing for? Goodhart's Law applies: once the benchmark becomes the target, it stops being a good benchmark. Benchmark gaming is optimization against your spec.

3. Model Bias (Architecture)

The architecture constrains what functions the model can represent. Inductive biases (convolutions assume locality, attention assumes relevance weighting) shape what the model can learn. Choose architectures whose biases align with safety.

4. Learning Bias (Optimizer)

SGD + regularization + learning rate schedules create implicit biases toward simpler, flatter solutions. These biases affect which of the many possible solutions the model converges to.

5. Environmental Selection (RLHF, Constitutional AI)

Post-training alignment techniques shape behavior through feedback. Effective but not guaranteed — the model may learn to satisfy the feedback mechanism rather than the intent behind it.

6. Evaluation and Testing

Red-teaming, adversarial testing, formal verification. Each catches different failures. None is complete alone.

7. Deployment Safeguards

Monitoring, rate limiting, human-in-the-loop, kill switches. The last line of defense.

The Key Insight

Each layer can fail. The system is only safe if enough layers hold. Defense in depth — not perfection at any single layer.

Safety Is a Stack#

The Layers#

1. Clean Data#

2. Good Specifications and Benchmarks#

3. Model Bias (Architecture)#

4. Learning Bias (Optimizer)#

5. Environmental Selection (RLHF, Constitutional AI)#

6. Evaluation and Testing#

7. Deployment Safeguards#

The Key Insight#