Alignment

Depth-Robust Safety: What Happens When You Truncate a Language Model

This isn't a jailbreak. Nobody is crafting adversarial prompts. It's an engineer deploying a model efficiently — and maybe, without realizing it, stripping out the safety training. The question Early exit is the idea that you can speed up a language model by stopping computation partway through the network. A model like Mistral-7B has 32 layers. If the model is "confident enough" by layer 27, you skip the last 5 layers and save 15% of the compute. Systems like CALM and LayerSkip use this in production, skipping 30–50% of layers while maintaining quality. ...

When Safety Training Backfires

Prior reading: A Survey of Alignment Techniques | Probing The Setup Human-based RL fine-tuning (RLHF) works by rewarding responses humans prefer and penalizing ones they don't. Usually this improves behavior. But sometimes suppressing a true but inappropriate response teaches the model the wrong lesson. The Problem: Value Attribution Is Hard When a human rates a response poorly, the training signal says "this was bad." It doesn't say why it was bad. Consider sensitive but objective topics — crime statistics disaggregated by race, for instance. A response citing real data might be rated poorly because: ...

A Survey of Alignment Techniques and Their Trade-Offs

Prior reading: What Is RL? | Layers of Safety | The Specification Problem Why This Post Exists Other posts in this blog cover how safety mechanisms fail (RLHF backfires, capability elicitation, CoT hackability). This post steps back and surveys what the mechanisms actually are, what assumptions they rest on, and where the known failure modes live. You need to understand the toolbox before you can understand why the tools break. Part I: Learning from Human Feedback RLHF (Reinforcement Learning from Human Feedback) What it does: Train a reward model on human preference rankings ("response A is better than response B"), then use RL (usually PPO) to optimize the language model's policy against that reward model. ...

Mesa-Optimization and the Optimization Pressure Spectrum

Prior reading: Gradient Descent and Backpropagation | Decision Theory for AI Safety | What Is RL? The Question Why does an AI system behave the way it does? And why do optimizers keep creating sub-optimizers with misaligned goals? Three frameworks give different (complementary) answers. But first — some terminology that trips people up. Mesa vs. Meta I've seen these two confused often enough that it's worth being explicit. I had to look up what "mesa" even meant the first time I encountered it. ...

The Specification and Language Problem

Prior reading: What Are Formal Methods? | Model Checking and Formal Specification The Problem Formal methods can prove a system satisfies a specification. The hard part is writing the specification. "Be helpful and don't cause harm" is not a formal spec. Turning it into one requires resolving ambiguity, edge cases, and value judgments that humans can't even agree on in natural language. Specification as Translation Every spec is a translation from human intent to formal language. Translation is lossy. The gap between what we mean and what we write is the specification problem. ...