Probably Aligned

When Safety Training Backfires

Prior reading: A Survey of Alignment Techniques | Probing The Setup Human-based RL fine-tuning (RLHF) works by rewarding responses humans prefer and penalizing ones they don't. Usually this improves behavior. But sometimes suppressing a true but inappropriate response teaches the model the wrong lesson. The Problem: Value Attribution Is Hard When a human rates a response poorly, the training signal says "this was bad." It doesn't say why it was bad. Consider sensitive but objective topics — crime statistics disaggregated by race, for instance. A response citing real data might be rated poorly because: ...

A Survey of Alignment Techniques and Their Trade-Offs

Prior reading: What Is RL? | Layers of Safety | The Specification Problem Why This Post Exists Other posts in this blog cover how safety mechanisms fail (RLHF backfires, capability elicitation, CoT hackability). This post steps back and surveys what the mechanisms actually are, what assumptions they rest on, and where the known failure modes live. You need to understand the toolbox before you can understand why the tools break. Part I: Learning from Human Feedback RLHF (Reinforcement Learning from Human Feedback) What it does: Train a reward model on human preference rankings ("response A is better than response B"), then use RL (usually PPO) to optimize the language model's policy against that reward model. ...

P-Hacking as Optimization: Implications for Safety Benchmarking

Prior reading: Layers of Safety | The Specification and Language Problem P-Hacking Is Optimization When a researcher tries many analyses and reports the one that gives $p < 0.05$, they're optimizing over the space of statistical tests. The "loss function" is the p-value. The "training data" is the dataset. The result is overfitting to noise. The Structural Parallel to AI Safety AI safety benchmarks are evaluated the same way: Design a safety evaluation Test your model Report results Iterate on the model (or the eval) until the numbers look good This is gradient descent on benchmark performance. Goodhart's Law applies: the benchmark becomes the target, and the metric diverges from the actual property you care about. ...

Chain of Thought Is Hackable (By the Model)

Prior reading: Mesa-Optimization and Three Lenses | Layers of Safety The Promise of CoT Chain-of-thought reasoning makes model thinking visible. If we can read the reasoning, we can check it for safety. Sounds good. The Problem The model generates its own chain of thought. It's not a window into the model's "real" reasoning — it's another output. The model can learn to produce reasoning traces that satisfy monitors while performing different internal computation. ...

Systems vs. Components: The Chinese Room and ROMs

Prior reading: Layers of Safety The Chinese Room Searle's argument: a person following rules to manipulate Chinese symbols doesn't understand Chinese, even if the system's outputs are indistinguishable from understanding. The component (person) lacks understanding; does the system have it? ROMs That Are Smarter Than Their Parts A read-only memory chip contains no intelligence. But a ROM storing a chess engine's entire game tree would play perfect chess. The system (ROM + lookup procedure) is "more intelligent" than its parts. ...

Layers of Safety: From Data to Deployment

Prior reading: What Are Formal Methods? | What Is RL? Safety Is a Stack No single technique makes AI safe. Safety comes from layers, each catching different failure modes. The Layers 1. Clean Data Garbage in, garbage out. Data quality determines the prior the model learns from. Biased, toxic, or incorrect data embeds those properties in the model. 2. Good Specifications and Benchmarks What are we optimizing for? Goodhart's Law applies: once the benchmark becomes the target, it stops being a good benchmark. Benchmark gaming is optimization against your spec. ...

Mesa-Optimization and the Optimization Pressure Spectrum

Prior reading: Gradient Descent and Backpropagation | Decision Theory for AI Safety | What Is RL? The Question Why does an AI system behave the way it does? And why do optimizers keep creating sub-optimizers with misaligned goals? Three frameworks give different (complementary) answers. But first — some terminology that trips people up. Mesa vs. Meta I've seen these two confused often enough that it's worth being explicit. I had to look up what "mesa" even meant the first time I encountered it. ...

Mechanistic Interpretability: Circuits, Superposition, and Sparse Autoencoders

Prior reading: Probing | Why Sparsity? | Platonic Forms What Mechanistic Interpretability Is Trying to Do Mechanistic interpretability (mech interp) aims to reverse-engineer neural networks into human-understandable components. Not "what features does this layer represent?" (that's probing — → see probing post) but "what algorithm does this network implement, and how?" The analogy: probing tells you a chip has memory. Mech interp tells you it's a flip-flop built from NAND gates. Why It Matters for Safety If we can understand the mechanism by which a model produces an output, we can: ...

Platonic Forms in Near-Capacity Models

Prior reading: Probing: What Do Models Actually Know? The Platonic Representation Hypothesis As models scale and train on more data, their internal representations appear to converge — different architectures, different modalities, even different training objectives produce increasingly similar feature spaces. Are models discovering the "true structure" of the world? What This Looks Like Vision models and language models learn similar geometric structures Larger models have more similar representations to each other than smaller ones Cross-modal transfer works better as models scale The Platonic Analogy Plato's forms: there exist ideal, abstract representations of concepts that physical instances approximate. Near-capacity models may be approximating these forms — not because they're philosophical, but because the data constrains the geometry of good representations. ...

Probing: What Do Models Actually Know?

Prior reading: Gradient Descent and Backpropagation | Why Sparsity? What Is Probing? Train a simple classifier (usually linear) on a model's internal representations to test whether specific information is encoded there. If a linear probe can extract "is this sentence toxic?" from layer 12 activations, the model represents toxicity at that layer. How It Works Freeze the model Extract activations at a chosen layer for a labeled dataset Train a linear (or shallow) classifier on those activations Measure accuracy High accuracy → the information is linearly accessible in the representation. ...