Rlhf | Probably Aligned

Depth-Robust Safety: What Happens When You Truncate a Language Model

This isn't a jailbreak. Nobody is crafting adversarial prompts. It's an engineer deploying a model efficiently — and maybe, without realizing it, stripping out the safety training. The question Early exit is the idea that you can speed up a language model by stopping computation partway through the network. A model like Mistral-7B has 32 layers. If the model is "confident enough" by layer 27, you skip the last 5 layers and save 15% of the compute. Systems like CALM and LayerSkip use this in production, skipping 30–50% of layers while maintaining quality. ...

Safety Training as Capability Elicitation

Prior reading: When Safety Training Backfires | Probing The Paradox To refuse a dangerous request, a model must first understand what's being asked well enough to recognize it as dangerous. Training a model to filter bioweapon synthesis queries requires the model to sharpen its internal representation of bioweapon synthesis — not blur it. The safety mechanism is a drug-sniffing dog: you have to teach it what drugs smell like. The Mechanism Consider what refusal training actually does in representation space. Before safety fine-tuning, a model may have a vague, diffuse representation of some dangerous domain — enough to generate mediocre outputs if prompted, but not deeply structured. ...

When Safety Training Backfires

Prior reading: A Survey of Alignment Techniques | Probing The Setup Human-based RL fine-tuning (RLHF) works by rewarding responses humans prefer and penalizing ones they don't. Usually this improves behavior. But sometimes suppressing a true but inappropriate response teaches the model the wrong lesson. The Problem: Value Attribution Is Hard When a human rates a response poorly, the training signal says "this was bad." It doesn't say why it was bad. Consider sensitive but objective topics — crime statistics disaggregated by race, for instance. A response citing real data might be rated poorly because: ...

A Survey of Alignment Techniques and Their Trade-Offs

Prior reading: What Is RL? | Layers of Safety | The Specification Problem Why This Post Exists Other posts in this blog cover how safety mechanisms fail (RLHF backfires, capability elicitation, CoT hackability). This post steps back and surveys what the mechanisms actually are, what assumptions they rest on, and where the known failure modes live. You need to understand the toolbox before you can understand why the tools break. Part I: Learning from Human Feedback RLHF (Reinforcement Learning from Human Feedback) What it does: Train a reward model on human preference rankings ("response A is better than response B"), then use RL (usually PPO) to optimize the language model's policy against that reward model. ...