Debate

Prior reading: What Is RL? | Layers of Safety | The Specification Problem Why This Post Exists Other posts in this blog cover how safety mechanisms fail (RLHF backfires, capability elicitation, CoT hackability). This post steps back and surveys what the mechanisms actually are, what assumptions they rest on, and where the known failure modes live. You need to understand the toolbox before you can understand why the tools break. Part I: Learning from Human Feedback RLHF (Reinforcement Learning from Human Feedback) What it does: Train a reward model on human preference rankings ("response A is better than response B"), then use RL (usually PPO) to optimize the language model's policy against that reward model. ...