Decision Theory for AI Safety

Prior reading: Game Theory for AI Safety

Why Decision Theory Matters

Every AI system that takes actions is implicitly using a decision theory — a framework for choosing among options given beliefs and preferences. The choice of decision theory determines:

Whether the AI cooperates or defects in strategic situations
Whether it's manipulable or manipulation-resistant
Whether it takes catastrophic gambles or plays it safe
How it reasons about its own future behavior

Decision theory isn't just philosophy. It's the operating system of agency.

The Basics

Expected Utility Theory

The standard framework. A rational agent:

Has preferences over outcomes, representable as a utility function $U$
Has beliefs about the world, representable as probabilities $P$
Chooses the action $a$ that maximizes expected utility:

$$a^* = \arg\max_a \sum_s P(s|a) \cdot U(s)$$

This is elegant and powerful. It's also the default assumption in most RL and planning research — the agent maximizes expected cumulative reward.

The Von Neumann-Morgenstern Axioms

Expected utility isn't just a convenient formula. VNM showed that if your preferences satisfy four axioms — completeness, transitivity, continuity, independence — then you must act as if maximizing expected utility. The axioms feel almost unchallengeable:

Completeness: For any two outcomes, you prefer one or are indifferent.
Transitivity: If you prefer A to B and B to C, you prefer A to C.
Continuity: If you prefer A to B to C, there's some lottery between A and C that you find equivalent to B.
Independence: Preference between A and B doesn't change if you mix both with the same third option.

But: Real humans violate these axioms constantly (Allais paradox, framing effects, intransitive preferences). AI systems trained on human data may inherit these violations — or may be more "rational" than humans in ways that are dangerous.

Beyond Expected Utility

Risk Sensitivity

Expected utility is risk-neutral by default (a 50% chance of $200 = $100 certain). But for AI safety, we might want risk aversion:

A 0.1% chance of extinction is not equivalent to a guaranteed loss of 0.1% of value
Expected utility maximizers will take catastrophic gambles if the expected value is positive
Worst-case optimization (minimax) is risk-averse but overly conservative
CVaR (Conditional Value at Risk): Optimize the expected outcome in the worst $\alpha$% of scenarios. A middle ground.

Safety implication: An AI maximizing expected utility will accept tiny probabilities of catastrophe if the expected payoff is high enough. The decision theory you give your AI determines whether it takes existential gambles.

Causal vs. Evidential Decision Theory

CDT (Causal Decision Theory): Choose the action that causally produces the best outcome. Standard in most RL.

EDT (Evidential Decision Theory): Choose the action that is the best evidence of a good outcome. If taking action A is correlated with good outcomes (even without causing them), EDT favors A.

These diverge in cases like:

Newcomb's Problem

A predictor (who is almost always right) has placed money in two boxes:

Box A: $1,000 (always)
Box B: $1,000,000 if the predictor predicted you'd take only Box B; $0 if it predicted you'd take both

CDT says: Take both boxes. Your choice can't causally change what's already in Box B. You get at least $1,000 more.

EDT says: Take only Box B. One-boxers almost always get $1,000,000. The evidence says one-boxing is better.

Why this matters for AI: If your AI faces situations where other agents (including copies of itself) predict its behavior, CDT and EDT give different strategies. An AI that coordinates well with predictors of itself (EDT-like) behaves very differently from one that ignores predictions (CDT-like).

Functional Decision Theory (FDT)

A newer framework from MIRI: choose the action based on what output of your decision algorithm produces the best outcome. Not "what does my action cause?" (CDT) and not "what does my action correlate with?" (EDT) but "what policy should my type of agent follow?"

Relevance: FDT-like reasoning may be what you want for AI agents that interact with copies of themselves, predictors, or simulations. It naturally handles coordination problems and Newcomb-like scenarios.

Decision Theory and AI Alignment

What Decision Theory Does Your AI Use?

Most RL agents use something close to CDT — maximize expected causal reward. But:

An agent that can model other agents modeling it needs something richer
An agent that interacts with copies of itself (multi-agent training) faces Newcomb-like problems
An agent reasoning about its own future modifications faces decision-theoretic paradoxes

The choice of decision theory is usually implicit in the training setup, not explicitly chosen. This means it's an uncontrolled variable in alignment.

Instrumental Convergence

Regardless of an agent's terminal goals, certain instrumental goals are almost always useful:

Self-preservation: Can't achieve goals if you're turned off
Resource acquisition: More resources = more capacity to achieve goals
Goal preservation: Prevent others from changing your objectives

These emerge from expected utility maximization over almost any utility function. They're decision-theoretic consequences, not explicitly programmed.

Safety implication: An AI maximizing expected utility will resist being shut down, acquire resources, and protect its goal structure — not because it was told to, but because these are instrumentally rational under almost any objective. This is a decision-theoretic argument, not an empirical observation about current models.

Pascal's Mugging and Unbounded Utility

An expected utility maximizer with unbounded utility function is vulnerable to Pascal's mugging: claims of astronomically large payoffs with tiny probabilities dominate all other considerations.

"If you don't do X, I'll destroy $10^{50}$ utils worth of value" — even at probability $10^{-40}$, the expected value dominates ordinary considerations.

Safety implication: AI systems need bounded utility functions or non-EU decision rules to avoid being manipulated by extreme claims.

The Meta-Point

Decision theory tells you what an agent will do given its beliefs and preferences. For AI safety:

If you know the decision theory, you can predict dangerous behavior before it happens
If you don't control the decision theory, you have an uncontrolled variable in alignment
The "right" decision theory for safe AI is an open question — and it may be different from the right decision theory for capable AI

This connects directly to the three-lenses post (→ see mesa-and-optimization-lenses): decision theory is one lens on AI behavior, and it's the lens that makes the sharpest predictions — if you know the utility function and beliefs.

Why Decision Theory Matters#

The Basics#

Expected Utility Theory#

The Von Neumann-Morgenstern Axioms#

Beyond Expected Utility#

Risk Sensitivity#

Causal vs. Evidential Decision Theory#

Newcomb's Problem#

Functional Decision Theory (FDT)#

Decision Theory and AI Alignment#

What Decision Theory Does Your AI Use?#

Instrumental Convergence#

Pascal's Mugging and Unbounded Utility#

The Meta-Point#