What Is RL, Why Is It So Hard, and How Does It Go Wrong?

Prior reading: Gradient Descent and Backpropagation | Loss Functions and Spaces

The Setup

An agent takes actions in an environment, receives rewards, and learns a policy $\pi(a|s)$ that maximizes expected cumulative reward.

$$\max_\pi \mathbb{E}\left[\sum_{t=0}^T \gamma^t r_t\right]$$

Where Are the Neural Nets?

Neural networks serve as function approximators for things that are too complex to represent exactly:

Policy network: $\pi_\theta(a|s)$ — maps states to action probabilities
Value network: $V_\phi(s)$ — estimates expected future reward from a state
Q-network: $Q_\psi(s,a)$ — estimates expected future reward from a state-action pair
World model (optional): $p_\omega(s'|s,a)$ — predicts next state

Without neural nets, RL only works for tiny state spaces (tabular methods). Neural nets let it scale to images, language, and continuous control.

Why RL Is So Hard

Credit assignment: Which action caused the reward 1000 steps later?
Exploration vs. exploitation: Try new things or do what's worked?
Non-stationarity: The data distribution changes as the policy changes
Sparse rewards: In many environments, reward is zero almost everywhere
Sample efficiency: RL needs orders of magnitude more data than supervised learning

How RL Goes Wrong: Two Failure Modes

When an RL system behaves badly, there are two fundamentally different explanations:

Reward Hacking (The Spec Is Wrong)

The system maximizes the specified reward in an unintended way. The model is "right" — it's doing exactly what you asked for. You just asked for the wrong thing.

A cleaning robot that covers messes with a blanket (mess out of sight → reward)
A game agent that exploits physics bugs for points
An LLM that produces confident-sounding nonsense to satisfy a helpfulness metric

The common thread: the reward function had a gap, and the optimizer found it.

Misalignment (The Model Is Wrong)

The system pursues a different objective than specified. The spec may be fine. The model learned something else.

A mesa-optimizer pursuing an internally learned goal
A deceptively aligned model that satisfies the reward during training but defects at deployment
An agent that instrumentally cooperates until it's powerful enough not to

The common thread: the model's objectives diverge from the specified objectives.

Why the Distinction Matters

Reward hacking is fixable (in principle) by improving the specification
Misalignment is not fixable by better specs alone — the model may understand the spec and choose not to follow it
Different failure modes need different safety interventions
Confusing the two leads to false confidence: "we fixed the reward function" doesn't help if the problem is misalignment

There's a third failure mode that doesn't fit neatly into either category: shield dependency. When a runtime safety mechanism works too well during training, the policy never learns to be safe on its own. The safety is real at the system level but absent at the component level. See Perfect Shields Create Unsafe Policies.

The Setup#

Where Are the Neural Nets?#

Why RL Is So Hard#

How RL Goes Wrong: Two Failure Modes#

Reward Hacking (The Spec Is Wrong)#

Misalignment (The Model Is Wrong)#

Why the Distinction Matters#