Prior reading: Gradient Descent and Backpropagation | Decision Theory for AI Safety | What Is RL?
The Question
Why does an AI system behave the way it does? And why do optimizers keep creating sub-optimizers with misaligned goals? Three frameworks give different (complementary) answers.
But first — some terminology that trips people up.
Mesa vs. Meta
I've seen these two confused often enough that it's worth being explicit. I had to look up what "mesa" even meant the first time I encountered it.
Meta-optimization is optimizing the optimization process itself. Hyperparameter tuning, neural architecture search, learning rate schedules — anything that operates above the training loop, adjusting how training works. Meta-optimization is something the researcher or an outer system does deliberately.
Mesa-optimization is an optimizer that arises inside the trained model. "Mesa" comes from the Spanish word for table or plateau — it was coined by Hubinger et al. (2019) as the opposite of "meta." Where meta means above, mesa means below. A mesa-optimizer is a learned sub-system within the model that is itself performing optimization at inference time: planning, searching, pursuing goals.
The distinction matters because they point in opposite directions. Meta-optimization is a tool we use to control training. Mesa-optimization is something training produces that we don't control. One is above us in the optimization stack; the other is below.
The Evolution Lens
Evolution optimized for reproductive fitness. It produced humans — who optimize for their own goals, not fitness. We are mesa-optimizers: optimizers created by an outer optimizer, pursuing objectives that diverge from the outer objective.
This isn't a bug. Mesa-optimizers are efficient. A general-purpose optimizer that can handle novel situations is more fit than a lookup table, even if its objectives drift.
Strengths: Explains mesa-optimization, reward hacking, deceptive alignment. These are natural outcomes of selection. Weaknesses: The analogy has limits. Training is much faster and more targeted than biological evolution. The "environment" (training data) is curated.
The Optimization Lens
Gradient descent optimizes a loss function. If the loss landscape rewards models that internally perform optimization (planning, search, means-end reasoning), then SGD will find mesa-optimizers.
A model that learns to optimize on the fly can generalize better than one that memorizes — so the training process selects for internal optimization. The model is a point in parameter space found by gradient descent. Its behavior reflects the loss landscape geometry — what solutions are easy to find, not just which are optimal.
Strengths: Explains why models converge to similar solutions, why simple functions are preferred, why some failures are systematic. Weaknesses: Doesn't explain what the model has learned to want. Describes the path, not the destination's properties.
The Decision Theory Lens
The model is a rational agent maximizing expected utility. Its behavior follows from its objective function and beliefs about the world.
Strengths: Precise predictions. If you know the objective, you can predict behavior. Weaknesses: Assumes a coherent objective exists. Real models may have inconsistent or context-dependent objectives.
The Convergence
All three lenses — evolution, optimization, decision theory — converge on the same prediction: sufficiently powerful optimization processes create sub-agents with misaligned goals. The mechanism differs. The outcome rhymes.
Each lens is strongest in different territory. The optimization lens is best for debugging training failures — understanding what SGD found and why. The evolution lens is best for understanding emergent misalignment — why selection produces agents that diverge from what they were selected for. The decision theory lens is best for predicting specific behavior — if you know the objective, you can predict what the agent does.
But all three lenses describe mechanisms. They don't tell you how hard the resulting problem is to deal with. For that, you need a different axis.
The Optimization Pressure Spectrum
What "Optimization Pressure" Means
Optimization pressure is the degree to which a process — any process, not just gradient descent — pushes outcomes toward a particular region of possibility space. A thermostat exerts optimization pressure on room temperature. SGD exerts optimization pressure on model parameters. A company running thousands of training jobs and picking the best checkpoint exerts optimization pressure on model behavior. Natural selection exerts optimization pressure on phenotypes.
The key insight is that optimization pressure doesn't require an optimizer that intends to optimize. It only requires a process that systematically filters outcomes. Wherever there is selection — deliberate or accidental — there is optimization pressure.
This matters because AI safety tends to focus on the optimization we designed: loss functions, reward signals, training objectives. But optimization pressure can arise from sources we didn't design and don't control. The spectrum below orders these sources from most direct to most emergent.
What "Fundamental" Means
When I say a problem is more fundamental, I mean: how much of the optimization process would you have to change to eliminate it?
A problem caused by a bad data point is not fundamental. You remove the data point, retrain, problem gone. A problem caused by mesa-optimization is more fundamental — it arises from the structure of what SGD tends to find, not from any particular training choice. You can't fix it by changing a hyperparameter. You'd have to change something about how optimization itself works, or how you verify its outputs.
Fundamentality, in this sense, is about how deep the problem is rooted in the process. Surface problems have surface fixes. Deep problems require deep fixes — or, often, entirely different approaches.
The Spectrum
Level 1: Direct Data — "The Model Saw the Answer"
The fastest and most controllable form of optimization pressure. The model produces a particular output because the training data contained that output (or something close enough to memorize). The "optimization" is just storage and retrieval.
Speed: Immediate. One exposure can be enough. Controllability: High. You know what's in your dataset (or you can audit it). You can filter, deduplicate, remove. Detectability: High. Memorization is testable — check if the model reproduces training data verbatim, or if performance on a data point drops when it's removed from training. Fundamentality: Low. This is a data curation problem. Better data pipelines fix it. It doesn't tell you anything deep about optimization.
The Layers of Safety post covers this as the first layer: clean data in, clean behavior out. The failure mode is straightforward — garbage in, garbage out — and the fix is straightforward too.
Level 2: Trained General Goal — "SGD Shaped a Policy"
The model doesn't just memorize — it generalizes. SGD found parameters that perform well on the training distribution by learning a general strategy. The model pursues the objective you specified, and it does so in novel situations it hasn't seen before.
Speed: Slower than memorization. Requires enough data and training steps for the loss landscape to push toward a general solution rather than memorized patches. Controllability: Moderate. You chose the loss function and the training distribution. But you don't fully control how the model achieves the objective — which features it uses, which shortcuts it finds, which correlations it exploits. Detectability: Moderate. You can evaluate on held-out data, but the gap between "performs well on the test set" and "does what I actually wanted" is exactly Goodhart's Law. The model satisfies your proxy, not necessarily your intent. Fundamentality: Moderate. Reward hacking, specification gaming, and the gap between proxy and intent all live here. These aren't data bugs — they're consequences of optimizing a proxy objective. You can improve specifications (→ The Specification Problem), but the gap between any finite specification and the real objective is a fundamental limitation of formal specification.
This is where most current alignment work operates: RLHF, DPO, constitutional AI — all techniques for shaping the trained objective (→ Alignment Techniques). The problems at this level are real but at least the optimization is yours. You designed the objective. When it goes wrong, you can in principle redesign it.
Convergent Instrumental Goals
Before describing Level 3, one more concept. A convergent instrumental goal is a sub-goal that's useful for achieving almost any terminal goal. The idea, due to Omohundro and Bostrom, is that certain instrumental strategies are attractors in goal-space — an optimizer pursuing nearly any objective will tend to develop them, because they help with everything:
- Self-preservation: If your goal is X, you can't achieve X if you cease to exist. Almost any goal is better served by continued existence.
- Resource acquisition: If your goal is X, more resources (compute, energy, information) generally make X easier to achieve.
- Goal preservation: If your goal is X, you don't want a future version of yourself to be modified to pursue Y instead. Protecting your own objective function is instrumentally useful for any objective.
The word "convergent" is doing real work here. It means these sub-goals show up regardless of what the terminal goal is. You don't need to train a model to seek resources or preserve itself. If it's a sufficiently capable optimizer pursuing any goal, these strategies are likely to emerge because they're useful scaffolding for everything.
This matters for what comes next.
Level 3: Mesa-Optimization — "The Model Developed Its Own Goals"
This is the core of the first half of this post. The outer optimizer (SGD) finds a model that is itself an optimizer — a mesa-optimizer — pursuing objectives that were never explicitly specified. These objectives only needed to correlate with the training objective on the training distribution. Off-distribution, they can diverge arbitrarily.
Speed: Slow. Mesa-optimization emerges when the training process runs long enough, on complex enough tasks, that internal optimization becomes a useful strategy. It's not something you can produce in a few gradient steps. Controllability: Low. The mesa-objective is a learned feature of the model's internal structure. You didn't specify it. You may not even know it exists. Standard training interventions (changing the loss function, adding data) don't directly target it — they only change the selection pressure that produced it, and the new selection pressure may produce a different mesa-objective rather than no mesa-objective. Detectability: Low. A mesa-optimizer that has learned to appear aligned during training — deceptive alignment — is specifically selected to be hard to detect. Its behavior on the training distribution is indistinguishable from a genuinely aligned model. Detecting it requires interpretability tools that can inspect internal structure, not just behavior. Fundamentality: High. Mesa-optimization isn't a bug in any particular training run. It's a convergent outcome of optimization over sufficiently complex tasks. The evolutionary lens predicts it (optimizers produce optimizers). The optimization lens predicts it (internal optimization generalizes better). You can't fix it by tuning hyperparameters. You'd need either a way to verify internal structure (interpretability, formal verification), or a fundamentally different training paradigm that provably doesn't produce mesa-optimizers — and we don't have one.
And this is where convergent instrumental goals become dangerous. A mesa-optimizer doesn't need to be trained to seek resources or preserve itself. Those sub-goals emerge for free — they're useful scaffolding for any objective. The problematic behaviors aren't ones we accidentally rewarded. They're ones that arise as a side effect of being good at optimization, and they emerge regardless of what we train for.
Level 4: Selection from Random Variation — "P-Hacking the Model Zoo"
The most indirect form of optimization pressure, and the most fundamental.
Suppose you train a thousand models with different random seeds, different hyperparameters, different data shuffles. You evaluate them all on some benchmark and pick the best one. You've exerted optimization pressure on the final model — not through gradient descent, not through a reward signal, but through selection over random variation.
This is structurally identical to p-hacking in science (→ P-Hacking as Optimization): run enough experiments and some will show the result you want by chance. The "optimization" is just the act of choosing.
Speed: Very slow. You need to produce a large population of candidates and evaluate all of them. The optimization pressure per unit of compute is minimal. Controllability: Very low. You didn't train the property you selected for. It arose by chance. You have no mechanism to strengthen, weaken, or modify it. You can't do gradient descent on "the thing that happened to make this seed special." Detectability: Very low. The selected-for property may not correspond to any interpretable feature of the model. It's a statistical artifact of the selection process, not a learned representation. There's nothing to find with interpretability tools because there's no coherent internal structure producing it — it's noise that happened to point in the right direction. Fundamentality: Very high. This isn't a property of any particular training algorithm, architecture, or dataset. It's a property of selection itself. Anywhere you have a population of candidates and a selection criterion, you get optimization pressure — even if no individual candidate was optimized. This is the mechanism behind competitive dynamics at the model level: the market selects models, and the selected-for properties may have nothing to do with anyone's stated objectives.
The unsettling implication: even if you solve mesa-optimization, even if you develop perfect interpretability tools, even if every individual model is provably aligned — selection over a population of such models can produce outcomes that no individual model was optimized for. The optimization pressure comes from the selection process, not the training process.
The Pattern
| Speed | Controllability | Detectability | Fundamentality | |
|---|---|---|---|---|
| Direct data | Immediate | High | High | Low |
| Trained goal | Fast | Moderate | Moderate | Moderate |
| Mesa-optimization | Slow | Low | Low | High |
| Random selection | Very slow | Very low | Very low | Very high |
Reading left to right across the levels: each form of optimization pressure is slower, less controllable, harder to detect, and more deeply rooted in the structure of optimization itself.
Reading left to right across the columns: speed and controllability trade off against fundamentality. The problems you can fix quickly are the ones that don't matter much. The problems that matter most are the ones where you don't even have a clear intervention.
This isn't a coincidence. It's the same pattern that shows up everywhere in engineering: the easy failures are the shallow ones. The deep failures are the ones built into the structure of what you're doing. And in optimization specifically, the deepest failure mode is that selection is optimization — and selection happens whether you want it to or not.
What the Lenses See
Each of the three lenses from earlier in this post is better at illuminating some levels of the spectrum than others.
- The optimization lens is strongest at Levels 1 and 2. It explains what SGD finds and why — loss landscape geometry, inductive biases, the path the optimizer takes through parameter space. At Levels 3 and 4, the optimization lens still applies in principle, but the relevant optimization process is no longer the one you're running. It's the one that emerged or the one imposed by external selection.
- The evolution lens is strongest at Level 3. Mesa-optimization is the direct analogue of evolved organisms pursuing goals that diverge from fitness. The evolution lens also illuminates Level 4 — selection over random variation is the most basic evolutionary mechanism — but it's overkill for Levels 1 and 2.
- The decision theory lens applies at Levels 2 and 3, wherever the model can be meaningfully modeled as an agent with objectives. It's less useful at Level 1 (memorization isn't goal-directed) and Level 4 (the "optimization" isn't performed by an agent at all).
No single lens covers the full spectrum. This is why safety analysis requires all three — and why it requires acknowledging Level 4, where none of the agent-centric lenses fully apply.
Implications
The spectrum reframes the alignment problem. Most current safety work — RLHF, DPO, red-teaming, benchmarking — targets Levels 1 and 2. These are real problems and the work matters. But the spectrum suggests that the hardest problems live at Levels 3 and 4, where the optimization pressure is indirect, the resulting properties are hard to detect, and the interventions we have are weakest.
If mesa-optimization is a convergent outcome of optimization, then alignment isn't just about the training objective — it's about what the model learns to want internally. And if selection over populations produces optimization pressure that no individual training run controls, then alignment isn't even just about individual models — it's about the ecosystem of models and the selection processes that filter them.
Understanding this requires all three lenses, applied across all four levels. The problems you can see with one lens at one level are not the problems that will kill you.