Perfect Shields Create Unsafe Policies

This post is about a paradox in safe reinforcement learning: the better your safety mechanism works during training, the less safe the trained agent might be without it. What's a safety shield? In reinforcement learning, an agent takes actions in an environment, receives rewards, and learns a policy — a mapping from situations to actions. The goal is to maximize cumulative reward. The problem is that during training (and sometimes after), the agent might do dangerous things. ...

April 1, 2026 · 9 min · Austin T. O'Quinn

What Is RL, Why Is It So Hard, and How Does It Go Wrong?

Prior reading: Gradient Descent and Backpropagation | Loss Functions and Spaces The Setup An agent takes actions in an environment, receives rewards, and learns a policy $\pi(a|s)$ that maximizes expected cumulative reward. $$\max_\pi \mathbb{E}\left[\sum_{t=0}^T \gamma^t r_t\right]$$ Where Are the Neural Nets? Neural networks serve as function approximators for things that are too complex to represent exactly: Policy network: $\pi_\theta(a|s)$ — maps states to action probabilities Value network: $V_\phi(s)$ — estimates expected future reward from a state Q-network: $Q_\psi(s,a)$ — estimates expected future reward from a state-action pair World model (optional): $p_\omega(s'|s,a)$ — predicts next state Without neural nets, RL only works for tiny state spaces (tabular methods). Neural nets let it scale to images, language, and continuous control. ...

February 26, 2025 · 3 min · Austin T. O'Quinn
.