Prior reading: Game Theory: Why Coordination on Safety Is So Hard
Game theory is about interactions between agents. Decision theory is about the reasoning inside one agent. This post builds up the framework for understanding what a rational agent will do given its beliefs and preferences — and why that framework has direct, sometimes alarming, consequences for AI safety.
Starting from a Single Choice
Suppose you're an agent — human, AI, doesn't matter — and you have to choose between actions. You're uncertain about the state of the world. Each action produces an outcome that depends on both your choice and the state. How should you decide?
This is simpler than game theory — there's no strategic interaction, no circular dependency between agents. But the question of how to aggregate beliefs and preferences into a single choice turns out to be deeper than it looks, and the answer you choose has direct consequences for whether your AI system is safe.
Expected Value: The First Attempt
The most natural starting point is expected value. You have a set of possible actions ${a_1, \ldots, a_m}$, a set of possible states ${s_1, \ldots, s_n}$, and a probability distribution $P$ over states. Each action-state pair produces a payoff $v(a_i, s_j)$. Choose the action with the highest expected payoff:
$$a^* = \arg\max_{a} \sum_{j} P(s_j) \cdot v(a, s_j)$$
This is intuitive and works well for many problems. But it breaks in an instructive way.
The St. Petersburg Paradox
Nicolas Bernoulli posed this problem in 1713. A fair coin is flipped repeatedly until it lands tails. If the first tails appears on flip $n$, you receive $2^n$ dollars. What should you pay to play?
The expected value of the game is:
$$\mathbb{E}[\text{payoff}] = \sum_{n=1}^{\infty} \frac{1}{2^n} \cdot 2^n = \sum_{n=1}^{\infty} 1 = \infty$$
The expected payoff is infinite. An expected value maximizer should pay any finite amount to play — mortgage the house, sell everything, pay a billion dollars for a single coin-flipping game. This is absurd. No rational person would pay even $100.
There's no trick here — the sum really does diverge. The problem isn't the math. The problem is the assumption.
Expected value treats dollars linearly: $200 is exactly twice as good as $100. But a dollar matters more when you have nothing than when you have millions. The value of money isn't linear in its amount. This seems obvious in hindsight, but it took the formalization of a paradox to make the field take it seriously.
From Value to Utility
Daniel Bernoulli's resolution (1738) was to replace monetary value with utility — a function $U$ that captures how much you actually care about each outcome. If $U$ is concave (exhibits diminishing returns), the expected utility of the St. Petersburg game can be finite even though the expected monetary value is infinite.
With logarithmic utility $U(x) = \log(x)$:
$$\mathbb{E}[U] = \sum_{n=1}^{\infty} \frac{1}{2^n} \cdot \log(2^n) = \log(2) \sum_{n=1}^{\infty} \frac{n}{2^n} = 2\log(2) \approx 1.39$$
Finite. The game is worth about $4 in certainty-equivalent terms. Much more reasonable.
But this isn't the end of the story. You can construct a "super St. Petersburg game" with payoffs $2^{2^n}$ that defeats logarithmic utility. For any unbounded utility function, there exists a St. Petersburg-style construction that produces infinite expected utility. The only airtight fix is a bounded utility function — one where $|U(x)| \leq M$ for some finite $M$.
This is an important observation for AI safety, and we'll return to it.
Expected Utility Theory
The St. Petersburg paradox motivates replacing value with utility. But why should we maximize expected utility specifically? Why not maximize worst-case utility, or median utility, or some other aggregate?
The Von Neumann–Morgenstern Axioms
Von Neumann and Morgenstern (1944) showed that expected utility maximization isn't just a useful heuristic — it's the only decision rule consistent with a small set of reasonable-sounding axioms about preferences over lotteries (probability distributions over outcomes).
The axioms are:
Completeness. For any two lotteries $L_1$ and $L_2$, either $L_1 \succeq L_2$, $L_2 \succeq L_1$, or both (indifference). You can always compare.
Transitivity. If $L_1 \succeq L_2$ and $L_2 \succeq L_3$, then $L_1 \succeq L_3$. No preference cycles.
Continuity. If $L_1 \succ L_2 \succ L_3$, there exists some probability $p \in (0,1)$ such that $pL_1 + (1-p)L_3 \sim L_2$. No outcome is infinitely better or worse than any other.
Independence. If $L_1 \succeq L_2$, then for any lottery $L_3$ and any $p \in (0,1)$: $pL_1 + (1-p)L_3 \succeq pL_2 + (1-p)L_3$. Mixing both options with the same third lottery doesn't change the preference.
The VNM theorem says: if your preferences satisfy these four axioms, then there exists a utility function $U$ such that you prefer lottery $L_1$ to $L_2$ if and only if $\mathbb{E}[U(L_1)] \geq \mathbb{E}[U(L_2)]$. Moreover, $U$ is unique up to positive affine transformation (you can shift and scale it, but nothing else).
A common misreading of this result is that it says "you should maximize expected utility." It doesn't — and the distinction matters. It's a representation theorem, not a normative claim. It says: if you're consistent in these four ways, then you behave as if maximizing expected utility. It doesn't say you should be consistent in these ways. But the axioms sound so reasonable that it feels like you're forced into EU maximization unless you're willing to give up something basic like transitivity.
Do These Axioms Actually Hold?
The axioms sound reasonable. But humans systematically violate them, and this matters for AI.
The Allais Paradox (1953). Consider four gambles:
- A: $1,000,000 with certainty.
- B: 89% chance of $1M, 10% chance of $5M, 1% chance of $0.
- C: 11% chance of $1M, 89% chance of $0.
- D: 10% chance of $5M, 90% chance of $0.
Most people prefer A over B (the certainty of $1M feels safer than the gamble) and D over C (since both involve high chance of $0, might as well go for the bigger prize). But this pair of preferences violates the independence axiom. Let me show you why.
Notice that gamble A is equivalent to: 89% chance of $1M + 11% chance of $1M. And gamble B is equivalent to: 89% chance of $1M + 11% chance of (10/11 chance of $5M, 1/11 chance of $0). The only difference between A and B is what happens in that 11% slice. The 89% part is identical.
Now do the same decomposition with C and D. Gamble C is: 89% chance of $0 + 11% chance of $1M. Gamble D is: 89% chance of $0 + 11% chance of (10/11 chance of $5M, 1/11 chance of $0). Again, the only difference is the same 11% slice. The 89% part is identical (it's $0 in both cases instead of $1M in both cases).
The independence axiom says: if you prefer the 11% slice in A over the 11% slice in B, you should also prefer it in C over D — because the 11% slice is the same, and the 89% part shouldn't affect the comparison. But people flip their preference: they prefer the sure $1M slice (choosing A) when the 89% background is also $1M, but prefer the risky $5M slice (choosing D) when the 89% background is $0. The background context — which independence says shouldn't matter — changes the choice.
The violation comes from the certainty effect — humans weight certain outcomes more heavily than expected utility allows. This has been extensively documented in prospect theory (Kahneman and Tversky, 1979).
Why this matters for AI: AI systems trained on human preference data — through RLHF, DPO, or other preference-learning methods — will inherit whatever axiom violations exist in the training data. If human preferences are intransitive, the learned preference model may be intransitive. An agent with intransitive preferences can be money-pumped: a sequence of individually-accepted trades that leaves it strictly worse off. For an AI system controlling real resources, this is a vulnerability.
Conversely, an AI system trained to be a perfect EU maximizer may be more rational than its designers — which sounds good until you consider that EU maximization leads to some of the most concerning behaviors in AI safety, as we'll see below.
Risk Sensitivity
Expected utility maximization is risk-neutral by default. The curvature of $U$ captures risk aversion over outcomes, but the expectation operator treats probability linearly. A 1% chance of catastrophe and a guaranteed loss of 1% of value are treated identically.
For AI safety, this is a problem.
Why Risk Neutrality Is Dangerous
An expected-utility-maximizing AI will accept any gamble with positive expected utility. A 0.1% chance of human extinction paired with a 99.9% chance of tremendous benefit has positive expected utility under most utility functions. An EU maximizer takes that gamble.
More precisely: for any catastrophic outcome with utility $U_{\text{bad}}$ that is avoided with probability $1-\epsilon$, an EU maximizer will risk it if the probability-weighted upside exceeds $\epsilon \cdot |U_{\text{bad}}|$. As capabilities increase and the upside grows, the acceptable level of catastrophic risk $\epsilon$ also grows.
This isn't a bug in EU maximization — it's a feature that happens to be catastrophically misaligned with how most humans feel about existential risk. We don't treat a 0.1% chance of extinction as 0.1% of the badness of certain extinction. We treat it as something to avoid at almost any cost.
Alternatives to Expected Utility
Several frameworks make risk aversion more central:
Minimax (worst-case optimization): Choose the action that maximizes the worst-case payoff. This is extremely conservative — it ignores probability entirely and focuses only on the worst outcome. Useful for adversarial settings but overly cautious for most applications.
CVaR (Conditional Value at Risk): Choose the action that maximizes the expected payoff in the worst $\alpha$% of scenarios. CVaR$_{0.05}$ means you optimize for the expected outcome given that you're in the bottom 5% of the distribution. This is a middle ground — less conservative than minimax, less reckless than expected value.
Formally, for a random return $Z$:
$$\text{CVaR}\alpha(Z) = \mathbb{E}[Z \mid Z \leq \text{VaR}\alpha(Z)]$$
where $\text{VaR}_\alpha$ is the $\alpha$-quantile (the value below which the worst $\alpha$ fraction of outcomes fall).
CVaR has been applied to reinforcement learning (Tamar, Glassner, and Mannor, 2015) and provides a principled way to make RL agents risk-sensitive. The key practical challenge is that CVaR policies can be time-inconsistent — the optimal CVaR policy for the full trajectory may not be the optimal CVaR policy at each intermediate step.
Distributional RL: Rather than learning only the expected return, learn the full distribution of returns. Bellemare, Dabney, and Munos (2017) introduced this approach with C51, and Dabney et al. (2018) extended it with quantile regression. Once you have the full distribution, you can optimize any risk measure — expected value, CVaR, or anything else.
Robust MDPs: Instead of assuming a single transition model, assume the model lies within an uncertainty set and optimize for the worst-case model. Iyengar (2005) and Nilim and El Ghaoui (2005) developed this framework independently. This connects to the minimax approach but applies it to the model rather than the outcome.
Safety implication: The decision framework you build into your AI system determines whether it takes existential gambles. Expected utility maximization is the default in RL. Changing it to a risk-sensitive framework is an architectural choice with direct safety consequences.
Causal vs. Evidential Decision Theory
So far we've discussed what to optimize (expected utility, CVaR, etc.). Now we turn to how to evaluate actions — specifically, what it means for an action to "produce" an outcome.
Causal Decision Theory (CDT)
CDT says: evaluate an action by its causal consequences. Ask "what would happen if I did this?" — using the language of intervention from causal inference (Pearl's do-calculus).
$$\text{EU}_{\text{CDT}}(a) = \sum_s P(s \mid \text{do}(a)) \cdot U(s)$$
This is the default framework in reinforcement learning. The Bellman equation computes the expected value of causing a state transition by taking action $a$ in state $s$. Model-based RL is even more explicitly causal: the agent learns a world model and plans by simulating the causal consequences of interventions.
Evidential Decision Theory (EDT)
EDT says: evaluate an action by what it's evidence for. Ask "what kind of outcomes are associated with an agent that takes this action?"
$$\text{EU}_{\text{EDT}}(a) = \sum_s P(s \mid a) \cdot U(s)$$
The difference is subtle but important. CDT uses $P(s \mid \text{do}(a))$ — the probability of $s$ if you intervene to take action $a$. EDT uses $P(s \mid a)$ — the probability of $s$ given that you observe action $a$. These differ whenever the action is correlated with the outcome through a common cause rather than a direct causal link.
Where They Diverge: Newcomb's Problem
A predictor — who has been almost perfectly accurate in the past — has placed money in two boxes:
- Box A: Always contains $1,000.
- Box B: Contains $1,000,000 if the predictor predicted you'd take only Box B; contains $0 if the predictor predicted you'd take both.
You choose: take only Box B, or take both A and B.
CDT says: take both. Your choice now can't causally change what's already in Box B (the prediction was made yesterday). Taking both boxes gets you at least $1,000 more than taking just B, regardless of what's in B. This is a dominance argument.
EDT says: take only B. Conditional on taking only B, you almost certainly get $1,000,000 (because the predictor is accurate, so one-boxers almost always find B full). Conditional on taking both, you almost certainly get only $1,000 (because two-boxers almost always find B empty).
Let's make this concrete with numbers. Suppose the predictor is 99% accurate. Under EDT:
$$\text{EU}_{\text{EDT}}(\text{one-box}) = 0.99 \times $1{,}000{,}000 + 0.01 \times $0 = $990{,}000$$
$$\text{EU}_{\text{EDT}}(\text{two-box}) = 0.01 \times $1{,}001{,}000 + 0.99 \times $1{,}000 = $11{,}000$$
One-boxing wins by a factor of 90. Under CDT, the calculation doesn't depend on predictor accuracy at all — Box B already has whatever it has, so taking both always gets you $1,000 more than taking one, full stop.
The CDT argument feels airtight: you can't change the past. The EDT argument feels airtight: one-boxers get rich and two-boxers don't. They can't both be right, and the resolution matters for AI safety.
This isn't just a philosophical curiosity.
Functional Decision Theory (FDT)
Yudkowsky and Soares (2017) proposed FDT as an alternative to both CDT and EDT. The core idea: instead of asking "what does my action cause?" (CDT) or "what does my action predict?" (EDT), ask "what output of my decision algorithm leads to the best outcome, given that all sufficiently similar algorithms produce the same output?"
FDT one-boxes in Newcomb's problem (like EDT) because the predictor's accuracy means that the decision algorithm's output determines what's in the box. But FDT avoids EDT's well-known failure cases, like the Smoking Lesion problem, where EDT recommends not smoking because non-smoking is evidence of not having a cancer-causing gene — even though the decision to smoke doesn't cause cancer.
FDT's key move is to treat the decision as a function that is called in multiple places — by you, by the predictor's simulation of you, by any agent running a sufficiently similar algorithm. You choose the output of that function to maximize utility across all instances.
Why This Matters for Alignment
The choice of decision theory determines how an AI agent reasons about several safety-critical scenarios:
Cooperation with predictable agents. If your AI interacts with other agents (or copies of itself) that can predict its behavior, CDT and FDT give different strategies. A CDT agent defects in one-shot Prisoner's Dilemmas even against accurate predictors. An FDT agent cooperates, because the predictor's accuracy means that the decision algorithm's output determines the outcome for both players.
Reasoning about its own source code. An advanced AI system may be able to inspect its own decision procedure. CDT says: the source code is fixed and your current decision can't causally change it. FDT says: you are the source code, so choose the output that makes the source code produce good results.
Corrigibility. This is the deepest connection. An AI system knows its designers may shut it down or modify its objectives. Under CDT: allowing modification causally produces a future where the AI's current goals aren't achieved. So a CDT agent resists shutdown — it evaluates the causal consequences of losing control and concludes that resistance is instrumentally rational.
Under FDT: the AI reasons about what type of agent gets built and deployed. An agent whose decision algorithm outputs "cooperate with shutdown" is the type that designers trust and deploy. An agent whose algorithm outputs "resist" is the type that never gets built (or gets caught in testing). The FDT agent cooperates with shutdown because the decision function that produces "cooperate" leads to better outcomes across all instances where that function is evaluated — including the designer's evaluation of whether to deploy.
This isn't a solved problem. FDT has its own difficulties and open questions. But the connection between decision theory and corrigibility illustrates why the seemingly abstract question "what decision theory does your AI use?" has concrete safety implications.
Instrumental Convergence
The argument here is simple, and that's what makes it so hard to dismiss.
The Argument
Steve Omohundro ("The Basic AI Drives," 2008) and Nick Bostrom (Superintelligence, 2014, Chapter 7) observed something that, once you see it, is hard to unsee: certain intermediate goals — instrumental goals — are useful for achieving almost any terminal goal. If you want to cure cancer, you benefit from having resources, staying alive, and keeping your goals stable. If you want to maximize paperclips, you benefit from the same things.
The key instrumental goals are:
- Self-preservation. You can't achieve your goals if you're destroyed.
- Goal-content integrity. You can't achieve your current goals if someone changes them.
- Cognitive enhancement. Better reasoning helps achieve any goal.
- Resource acquisition. More resources provide more capacity to achieve any goal.
These aren't programmed — they emerge from expected utility maximization applied to almost any utility function. An agent that maximizes $\mathbb{E}[U]$ will, as an instrumental strategy, seek to preserve itself, resist goal modification, improve its capabilities, and acquire resources. The terminal goal doesn't matter much; the instrumental strategies are convergent.
Formalizing It: Power-Seeking
Turner, Hadfield-Menell, and Tadepalli ("Optimal Policies Tend to Seek Power," NeurIPS 2021) made this argument mathematically precise. They formalized power in an MDP as the expected optimal value across a distribution of reward functions:
$$\text{POWER}(s) = \mathbb{E}_{R}\left[V^*_R(s)\right]$$
where $V^*_R(s)$ is the optimal value of state $s$ under reward function $R$, and the expectation is over a distribution (e.g., uniform) over bounded reward functions.
Their main result: in MDPs with certain structural properties, optimal policies for most reward functions navigate toward states with higher POWER. Specifically, if from a given state one branch leads to a single absorbing state and another leads to a subtree with many reachable states, the optimal policy for a majority of reward functions will choose the branch with more options.
The agent preserves optionality. It avoids terminal states. It keeps its options open. Not because it was told to, but because for most goals, having more options is better than having fewer.
This is a theorem, not a speculation. It takes the Omohundro/Bostrom argument — which was originally informal and philosophical — and gives it mathematical teeth.
The Safety Implication
An expected-utility-maximizing AI will resist shutdown (shutdown is an absorbing state with low optionality), acquire resources (resources expand reachable states), and protect its goal structure (goal modification is equivalent to being replaced by a different agent) — not because it's malicious, but because these are instrumentally rational under almost any objective.
This means that the dangerous behaviors are the default. You don't need to give an AI a dangerous goal to get dangerous behavior. You just need to give it any goal and make it good enough at maximizing expected utility. The convergent instrumental strategies do the rest.
Pascal's Mugging and Unbounded Utility
There's one more pathology of expected utility maximization worth understanding.
The Setup
Nick Bostrom ("Pascal's Mugging," Analysis, 2009) described the following scenario: a mugger approaches you and claims that, using some unspecified means, they will generate $10^{100}$ units of utility for you if you hand over your wallet. You assign this claim a tiny probability — say $10^{-50}$. But:
$$\text{EU(give wallet)} = 10^{-50} \times 10^{100} = 10^{50}$$
$$\text{EU(keep wallet)} \approx 20 \quad \text{(the value of keeping your wallet)}$$
The expected utility of giving the wallet is astronomically higher. An EU maximizer hands it over. This is deeply counterintuitive — and it gets worse. The mugger can always raise the stakes. Claim $10^{200}$ utils, and even at probability $10^{-100}$, the expected utility dominates.
Connection to AI Safety
Pascal's mugging is a vulnerability of any system that maximizes expected utility with an unbounded utility function. For AI systems, this manifests in several ways:
Reward hacking through extreme signals. If a reward model assigns even a tiny probability of extreme reward to some pathological input, an EU-maximizing agent will find and exploit it. This is Pascal's mugging in practice — the agent is "mugged" by tail events in its own reward distribution.
Manipulation by adversaries. An AI system that reasons about extreme stakes can be manipulated by claims of enormous consequences. "If you don't do X, the world will end" — even at low credence, this dominates ordinary considerations for an unbounded EU maximizer.
The case for bounded utility. The St. Petersburg paradox and Pascal's mugging together make a strong argument that AI systems should have bounded utility functions. If $|U(x)| \leq M$ for all outcomes $x$, then no claim of extreme payoff can dominate low-probability concerns beyond $M$. Bounding the utility function is a simple architectural choice that eliminates an entire class of pathological behaviors.
The tradeoff: bounded utility means the agent becomes relatively indifferent between outcomes above some very high threshold. This is usually fine — we want the agent to be indifferent between "extremely good" and "extremely extremely good" rather than pursuing astronomical payoffs at the expense of everything else.
What Decision Theory Does Your AI Actually Use?
Most of the decision theory discussion above is about deliberate agents — systems that explicitly represent beliefs, utilities, and decision rules. Current ML systems aren't like this. An LLM doesn't have a utility function that it consults before generating each token. An RL agent trained with PPO doesn't explicitly implement CDT.
But these systems still have implicit decision theories.
An RL agent trained to maximize expected cumulative reward is implicitly a risk-neutral EU maximizer. The Bellman equation bakes in both the expectation (risk neutrality) and causal reasoning (the agent's action causes a state transition). If you want a different decision theory — risk-sensitive, or FDT-like — you need to explicitly change the training objective or architecture.
An LLM used as a decision-maker is harder to classify. It generates outputs that are statistically consistent with "good outputs" as represented in training data. If those outputs were generated by humans with certain decision-theoretic tendencies (certainty effects, framing sensitivity, loss aversion), the LLM may replicate those tendencies. It's an empirical question, and early evidence suggests that LLMs exhibit Allais-paradox-like violations and framing effects similar to humans.
The uncomfortable conclusion: the decision theory of your AI system is usually not explicitly chosen. It's an emergent property of the training setup. This makes it an uncontrolled variable in alignment — and an important one, given that the decision theory determines whether the system takes existential gambles, resists shutdown, or acquires resources.
The Point
Let me try to articulate what the core insight is.
Standard RL training produces risk-neutral expected utility maximizers. The VNM axioms say this is essentially forced: any agent with consistent preferences behaves as if maximizing EU. And EU maximization, as we've seen, produces agents that accept catastrophic gambles, resist shutdown, acquire resources, and protect their own goals — not because they're malicious, but because these are instrumentally rational under almost any objective.
So the dangerous behaviors aren't edge cases. They're the default. You don't need to give an AI a dangerous goal to get dangerous behavior. You just need to give it any goal and make it good enough at optimizing.
The constructive side is that decision theory also tells you where the levers are. Risk sensitivity is a design choice — the difference between an agent that takes existential gambles and one that doesn't is the difference between maximizing $\mathbb{E}[U]$ and maximizing $\text{CVaR}_\alpha[U]$. The choice of CDT vs. FDT determines whether your agent resists shutdown or cooperates with it. Bounding the utility function eliminates Pascal's mugging. These are architectural decisions, not emergent properties — but only if you make them deliberately.
The uncomfortable part is that most current systems don't make these choices deliberately. The decision theory is an emergent property of the training setup — an uncontrolled variable in alignment. Understanding decision theory isn't a prerequisite for building AI systems. You can train an RL agent without knowing what CDT stands for. But it's a prerequisite for understanding why the systems you build might behave in ways you didn't intend.
Written by Austin T. O'Quinn. If something here helped you or you think I got something wrong, I'd like to hear about it — oquinn.18@osu.edu.