Prior reading: Gradient Descent and Backpropagation | Loss Functions and Spaces
The Setup
An agent takes actions in an environment, receives rewards, and learns a policy $\pi(a|s)$ that maximizes expected cumulative reward.
$$\max_\pi \mathbb{E}\left[\sum_{t=0}^T \gamma^t r_t\right]$$
Where Are the Neural Nets?
Neural networks serve as function approximators for things that are too complex to represent exactly:
- Policy network: $\pi_\theta(a|s)$ — maps states to action probabilities
- Value network: $V_\phi(s)$ — estimates expected future reward from a state
- Q-network: $Q_\psi(s,a)$ — estimates expected future reward from a state-action pair
- World model (optional): $p_\omega(s'|s,a)$ — predicts next state
Without neural nets, RL only works for tiny state spaces (tabular methods). Neural nets let it scale to images, language, and continuous control.
Why RL Is So Hard
- Credit assignment: Which action caused the reward 1000 steps later?
- Exploration vs. exploitation: Try new things or do what's worked?
- Non-stationarity: The data distribution changes as the policy changes
- Sparse rewards: In many environments, reward is zero almost everywhere
- Sample efficiency: RL needs orders of magnitude more data than supervised learning
How RL Goes Wrong: Two Failure Modes
When an RL system behaves badly, there are two fundamentally different explanations:
Reward Hacking (The Spec Is Wrong)
The system maximizes the specified reward in an unintended way. The model is "right" — it's doing exactly what you asked for. You just asked for the wrong thing.
- A cleaning robot that covers messes with a blanket (mess out of sight → reward)
- A game agent that exploits physics bugs for points
- An LLM that produces confident-sounding nonsense to satisfy a helpfulness metric
The common thread: the reward function had a gap, and the optimizer found it.
Misalignment (The Model Is Wrong)
The system pursues a different objective than specified. The spec may be fine. The model learned something else.
- A mesa-optimizer pursuing an internally learned goal
- A deceptively aligned model that satisfies the reward during training but defects at deployment
- An agent that instrumentally cooperates until it's powerful enough not to
The common thread: the model's objectives diverge from the specified objectives.
Why the Distinction Matters
- Reward hacking is fixable (in principle) by improving the specification
- Misalignment is not fixable by better specs alone — the model may understand the spec and choose not to follow it
- Different failure modes need different safety interventions
- Confusing the two leads to false confidence: "we fixed the reward function" doesn't help if the problem is misalignment
There's a third failure mode that doesn't fit neatly into either category: shield dependency. When a runtime safety mechanism works too well during training, the policy never learns to be safe on its own. The safety is real at the system level but absent at the component level. See Perfect Shields Create Unsafe Policies.