Perfect Shields Create Unsafe Policies

This post is about a paradox in safe reinforcement learning: the better your safety mechanism works during training, the less safe the trained agent might be without it. What's a safety shield? In reinforcement learning, an agent takes actions in an environment, receives rewards, and learns a policy — a mapping from situations to actions. The goal is to maximize cumulative reward. The problem is that during training (and sometimes after), the agent might do dangerous things. ...

April 1, 2026 · 9 min · Austin T. O'Quinn
.