Prior reading: What Is RL? | Layers of Safety | The Specification Problem

Why This Post Exists

Other posts in this blog cover how safety mechanisms fail (RLHF backfires, capability elicitation, CoT hackability). This post steps back and surveys what the mechanisms actually are, what assumptions they rest on, and where the known failure modes live. You need to understand the toolbox before you can understand why the tools break.


Part I: Learning from Human Feedback

RLHF (Reinforcement Learning from Human Feedback)

What it does: Train a reward model on human preference rankings ("response A is better than response B"), then use RL (usually PPO) to optimize the language model's policy against that reward model.

The pipeline:

  1. Collect human comparisons on model outputs
  2. Train a reward model $R_\phi(x, y)$ to predict human preference
  3. Optimize $\pi_\theta$ to maximize $R_\phi$ with a KL penalty to stay close to the base model:

$$\max_\theta \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta} \left[ R_\phi(x, y) - \beta \text{KL}(\pi_\theta | \pi_{\text{ref}}) \right]$$

Assumptions:

  • Human preferences are consistent and well-calibrated
  • The reward model generalizes from the comparison dataset to new inputs
  • The KL penalty prevents the policy from exploiting reward model errors

Where it breaks:

  • Reward model is a learned proxy → Goodhart's Law applies (→ see p-hacking post)
  • Human raters disagree, are inconsistent, and have biases
  • Value attribution problem: scalar reward doesn't explain why something was bad (→ see when-safety-training-backfires post)
  • KL penalty is a blunt instrument — it constrains total divergence, not divergence in safety-critical regions
  • Reward hacking: the model finds outputs that score high on $R_\phi$ without being genuinely better

DPO (Direct Preference Optimization)

What it does: Skip the reward model entirely. Reparameterize the RLHF objective so you can optimize directly on preference pairs using a classification loss.

$$\mathcal{L}{\text{DPO}} = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$$

where $y_w$ is the preferred response and $y_l$ is the dispreferred one.

Advantages over RLHF:

  • No reward model to train, maintain, or Goodhart
  • Simpler pipeline — supervised learning, not RL
  • More stable training (no PPO instabilities)

Disadvantages:

  • Still relies on quality of human preference data
  • The implicit reward model may be less expressive than an explicit one
  • Offline method — can't iteratively improve based on the model's own outputs during training
  • Same value attribution problem: pairwise preference doesn't say why one response is better

Where it breaks: Same fundamental issues as RLHF — the preferences are still a proxy, and the model can still learn to satisfy the proxy rather than the intent.

KTO (Kahneman-Tversky Optimization)

What it does: Learns from binary "good/bad" labels on individual outputs rather than pairwise comparisons. Inspired by prospect theory — treats gains and losses asymmetrically.

Why it matters: Pairwise comparisons are expensive to collect. Thumbs-up/thumbs-down is cheap. If the alignment signal can come from simpler feedback, alignment scales better.

Trade-off: Less information per data point, but more data points. Whether this trade-off is favorable depends on the task.


Part II: Constitutional and Rule-Based Approaches

Constitutional AI (CAI)

What it does: Replace some human feedback with model self-critique. Give the model a "constitution" — a set of principles — and have it:

  1. Generate a response
  2. Critique its own response against the principles
  3. Revise the response
  4. Train on the self-revised outputs (RLAIF — RL from AI Feedback)

Assumptions:

  • The model can meaningfully evaluate its own outputs against abstract principles
  • The constitution captures what we actually want (specification problem again)
  • Self-critique doesn't have the same blind spots as the original generation

Where it breaks:

  • The model critiques using the same representations that generated the problematic output — it may share the same blind spots
  • Constitutional principles are natural language → ambiguous, context-dependent
  • Self-play can converge on outputs that satisfy the letter of the constitution while violating its spirit
  • Who writes the constitution? That's a governance question, not a technical one

Rule-Based Reward Models (RBRMs)

What it does: Replace learned reward models with programmatic rules. "If the output contains personal information, penalize. If it refuses a benign request, penalize."

Advantages: Transparent, auditable, no reward model Goodharting.

Disadvantages: Rigid. Can't handle nuance. The rules are a specification — and the specification problem applies fully.


Part III: Scalable Oversight

The core problem: as models become more capable than their overseers, how do you provide a training signal you can trust?

Debate

What it does: Two AI agents argue opposing sides of a question. A human judge (who may not understand the full answer) decides who wins. The hope: truth has a structural advantage in debate because a lie can be exposed by the opponent.

Assumptions:

  • Truth is easier to defend than falsehood (game-theoretically, honest debate strategies dominate)
  • The human judge can evaluate arguments even if they can't generate the answer
  • The debate format doesn't reward rhetorical skill over correctness

Where it breaks:

  • Empirically uncertain whether truth actually has the structural advantage in all domains
  • Both debaters may collude to mislead the judge
  • Complex technical topics may be genuinely unjudgeable by a non-expert human
  • The judge becomes the bottleneck — and the judge is a human with all the usual biases

Iterated Distillation and Amplification (IDA)

What it does: Bootstraps oversight through a recursive process:

  1. A human + AI assistant solves tasks the human alone couldn't
  2. This "amplified" problem-solver trains a new AI
  3. The new AI assists the human on harder tasks
  4. Repeat

The idea: Each iteration produces a slightly more capable and aligned system, bootstrapping from human values through a chain of verified steps.

Assumptions:

  • Alignment is preserved at each distillation step
  • The amplification procedure doesn't introduce systematic errors that compound
  • The human remains a meaningful part of the loop (not just rubber-stamping)

Where it breaks:

  • Error accumulation across iterations — small alignment errors compound
  • At some point the human cannot verify the amplified system's reasoning (the whole problem we started with)
  • Assumes a clean decomposition of tasks into subtasks, which may not exist for open-ended reasoning

Recursive Reward Modeling

What it does: Use AI to help humans evaluate AI outputs, then use those evaluations to train the next model. Similar to IDA but focused specifically on the reward signal.

The risk: If the AI helper is subtly biased, it corrupts the reward signal, which corrupts the next model, which corrupts the next reward signal. The loop amplifies errors.


Part IV: Process-Based Approaches

Process Reward Models (PRMs)

What it does: Instead of rewarding only the final answer (outcome-based), reward each reasoning step. A PRM scores intermediate steps, not just conclusions.

Advantages:

  • Catches errors early in the reasoning chain
  • Provides richer gradient signal
  • Harder to hack — the model can't skip to a "good-looking" answer without valid steps

Disadvantages:

  • Requires labeled intermediate steps (expensive)
  • What counts as a "correct step" may be ambiguous
  • A model can learn to produce steps that look valid to the PRM without being logically sound (→ see CoT hackability post)

Outcome vs. Process: The Trade-Off

Outcome-based: easier to label, more Goodhartable, rewards hacky shortcuts. Process-based: harder to label, more robust, but requires solving "what is a good reasoning step?" which is its own alignment problem.


Part V: The Meta-Picture

What All These Approaches Share

Every alignment technique is a feedback loop:

  1. Model produces output
  2. Something evaluates the output (human, reward model, AI critic, rules)
  3. Model updates toward higher-evaluated outputs

The differences are in what does the evaluating and how rich is the signal. But they all face:

  • The specification problem: The evaluator implements a proxy for what we actually want
  • Goodhart's Law: Optimizing hard against any proxy degrades the proxy
  • Scalable oversight: As models exceed human capability, evaluation quality degrades

No Single Technique Is Sufficient

The layers-of-safety post (→ see layers-of-safety) argues safety comes from defense in depth. The same applies to alignment techniques:

  • RLHF/DPO for broad behavioral shaping
  • Constitutional AI for principle-level guardrails
  • Process reward models for reasoning quality
  • Formal verification for hard constraints (→ see formal methods section)
  • Runtime monitoring for deployment safety

No single technique solves alignment. The question is whether the stack is robust enough.

The Honest Assessment

Current alignment techniques are behavioral — they shape what the model does, not what it is. They train surface behavior without guarantees about internal structure. This is why mesa-optimization (→ see mesa-and-optimization-lenses) and deceptive alignment remain concerns even after extensive alignment training. Solving alignment at the representation level — not just the behavior level — is the open problem.