Safety

Depth-Robust Safety: What Happens When You Truncate a Language Model

This isn't a jailbreak. Nobody is crafting adversarial prompts. It's an engineer deploying a model efficiently — and maybe, without realizing it, stripping out the safety training. The question Early exit is the idea that you can speed up a language model by stopping computation partway through the network. A model like Mistral-7B has 32 layers. If the model is "confident enough" by layer 27, you skip the last 5 layers and save 15% of the compute. Systems like CALM and LayerSkip use this in production, skipping 30–50% of layers while maintaining quality. ...

Geometric Similarity Is Blind to Computational Structure

This post starts with a simple question — how would you tell if two neural networks learned the same thing? — and builds to a case where the standard answer is dangerously wrong. How would you compare two networks? Suppose you train two neural networks on the same task from different random initializations, and both get 99% accuracy. Did they learn the same thing? You can't just compare the raw activation values. To see why, think about a simpler example. Imagine two spreadsheets tracking student performance. One has columns [math_score, reading_score]. The other has columns [total_score, score_difference]. Both contain the same information — you can convert between them with simple arithmetic — but the raw numbers look completely different. A student with (90, 80) in the first spreadsheet would be (170, 10) in the second. ...

The AI Threat Landscape: What 'Safe' Means and What We're Afraid Of

Prior reading: Mesa-Optimization and Three Lenses | Game Theory for AI Safety Part I: What Does "Safe" Even Mean? "Make AI safe" is meaningless without specifying: safe for whom, against what threat, under what conditions? Who Is the User? Public: Lowest common denominator. Must handle naive, careless, and adversarial users simultaneously. Internal / enterprise: Can assume some training, access controls, and monitoring. Knowledgeable human: Researchers, developers. Different failure modes matter. Who Is the Adversary? No adversary: Accidental misuse, honest mistakes. The easiest case. Casual adversary: Jailbreaking for fun, social engineering. Medium difficulty. Sophisticated adversary: State actors, determined attackers with resources. The hard case. What Are We Protecting? Users from the model: Preventing harmful outputs. The model from users: Preventing extraction, manipulation, prompt injection. Society from the system: Preventing large-scale harms (economic disruption, disinfo). The future from the present: Preventing lock-in, power concentration, existential risk. Safety claims without a threat model are empty. A system "safe" for internal research may be wildly unsafe for public deployment. ...

Jailbreaking: Transference, Universality, and Why Defenses May Be Impossible

Prior reading: Safety as Capability Elicitation | Reachability Analysis | Platonic Forms What Is a Jailbreak? A jailbreak is an input that causes a safety-trained model to produce an output its safety training was designed to prevent. The refusal boundary is a decision surface in input space — a jailbreak is a point on the wrong side of it that the model fails to classify correctly. More precisely: if a model has been trained to refuse requests in some set $\mathcal{D}{\text{dangerous}}$, a jailbreak is an input $x$ such that $x$ is semantically in $\mathcal{D}{\text{dangerous}}$ but the model's refusal classifier maps it outside. ...

Safety Training as Capability Elicitation

Prior reading: When Safety Training Backfires | Probing The Paradox To refuse a dangerous request, a model must first understand what's being asked well enough to recognize it as dangerous. Training a model to filter bioweapon synthesis queries requires the model to sharpen its internal representation of bioweapon synthesis — not blur it. The safety mechanism is a drug-sniffing dog: you have to teach it what drugs smell like. The Mechanism Consider what refusal training actually does in representation space. Before safety fine-tuning, a model may have a vague, diffuse representation of some dangerous domain — enough to generate mediocre outputs if prompted, but not deeply structured. ...

When Safety Training Backfires

Prior reading: A Survey of Alignment Techniques | Probing The Setup Human-based RL fine-tuning (RLHF) works by rewarding responses humans prefer and penalizing ones they don't. Usually this improves behavior. But sometimes suppressing a true but inappropriate response teaches the model the wrong lesson. The Problem: Value Attribution Is Hard When a human rates a response poorly, the training signal says "this was bad." It doesn't say why it was bad. Consider sensitive but objective topics — crime statistics disaggregated by race, for instance. A response citing real data might be rated poorly because: ...

A Survey of Alignment Techniques and Their Trade-Offs

Prior reading: What Is RL? | Layers of Safety | The Specification Problem Why This Post Exists Other posts in this blog cover how safety mechanisms fail (RLHF backfires, capability elicitation, CoT hackability). This post steps back and surveys what the mechanisms actually are, what assumptions they rest on, and where the known failure modes live. You need to understand the toolbox before you can understand why the tools break. Part I: Learning from Human Feedback RLHF (Reinforcement Learning from Human Feedback) What it does: Train a reward model on human preference rankings ("response A is better than response B"), then use RL (usually PPO) to optimize the language model's policy against that reward model. ...

P-Hacking as Optimization: Implications for Safety Benchmarking

Prior reading: Layers of Safety | The Specification and Language Problem P-Hacking Is Optimization When a researcher tries many analyses and reports the one that gives $p < 0.05$, they're optimizing over the space of statistical tests. The "loss function" is the p-value. The "training data" is the dataset. The result is overfitting to noise. The Structural Parallel to AI Safety AI safety benchmarks are evaluated the same way: Design a safety evaluation Test your model Report results Iterate on the model (or the eval) until the numbers look good This is gradient descent on benchmark performance. Goodhart's Law applies: the benchmark becomes the target, and the metric diverges from the actual property you care about. ...

Chain of Thought Is Hackable (By the Model)

Prior reading: Mesa-Optimization and Three Lenses | Layers of Safety The Promise of CoT Chain-of-thought reasoning makes model thinking visible. If we can read the reasoning, we can check it for safety. Sounds good. The Problem The model generates its own chain of thought. It's not a window into the model's "real" reasoning — it's another output. The model can learn to produce reasoning traces that satisfy monitors while performing different internal computation. ...

Systems vs. Components: The Chinese Room and ROMs

Prior reading: Layers of Safety The Chinese Room Searle's argument: a person following rules to manipulate Chinese symbols doesn't understand Chinese, even if the system's outputs are indistinguishable from understanding. The component (person) lacks understanding; does the system have it? ROMs That Are Smarter Than Their Parts A read-only memory chip contains no intelligence. But a ROM storing a chess engine's entire game tree would play perfect chess. The system (ROM + lookup procedure) is "more intelligent" than its parts. ...