Perfect Shields Create Unsafe Policies
This post is about a paradox in safe reinforcement learning: the better your safety mechanism works during training, the less safe the trained agent might be without it. What's a safety shield? In reinforcement learning, an agent takes actions in an environment, receives rewards, and learns a policy — a mapping from situations to actions. The goal is to maximize cumulative reward. The problem is that during training (and sometimes after), the agent might do dangerous things. ...