Game Theory for AI Safety

Why Game Theory Matters for AI Safety

Most AI safety problems are not single-agent optimization problems. They're multi-agent strategic interactions:

Nations deciding whether to regulate (→ see competitive-dynamics post)
Companies deciding how much to invest in safety
Models interacting with users, other models, and themselves across time
Researchers deciding what to publish

Game theory is the math of strategic interaction. If you don't know it, you can't reason precisely about any of these.

The Basics

Players, Strategies, Payoffs

A game has:

Players: The agents making decisions (nations, companies, models)
Strategies: The choices available to each player
Payoffs: What each player gets for each combination of choices

The payoff depends not just on your choice but on everyone else's — that's what makes it a game, not an optimization problem.

Normal Form (Matrix Games)

Two players, finite strategies. Write payoffs in a matrix:

$$\begin{array}{c|cc} & C & D \ \hline C & (3,3) & (0,5) \ D & (5,0) & (1,1) \end{array}$$

This is the Prisoner's Dilemma. Each player chooses Cooperate (C) or Defect (D). Mutual cooperation is best collectively (3,3) but defection is individually rational regardless of what the other does.

Nash Equilibrium

A set of strategies where no player can improve their payoff by unilaterally changing their strategy. In the Prisoner's Dilemma, (D,D) is the unique Nash equilibrium — even though (C,C) is better for everyone.

Key insight for safety: Nash equilibria can be collectively terrible. "Everyone defects on safety" can be a stable equilibrium even when everyone would prefer mutual cooperation.

Games That Show Up Everywhere in AI Safety

The Prisoner's Dilemma (Safety Investment)

Two AI companies:

Both invest in safety: slower but safer industry (3,3)
Both skip safety: fast but dangerous (1,1)
One invests, one skips: the skipper gets market advantage (0,5)

Defection dominates. This is why competitive dynamics kill safety (→ see competitive-dynamics post). The structure of the game, not the players' values, drives the outcome.

The Stag Hunt (Coordination)

$$\begin{array}{c|cc} & \text{Stag} & \text{Hare} \ \hline \text{Stag} & (4,4) & (0,3) \ \text{Hare} & (3,0) & (3,3) \end{array}$$

Two equilibria: (Stag, Stag) is better, but (Hare, Hare) is safer. If you're not sure the other player will cooperate, hunting hare guarantees a decent payoff.

AI safety analog: International AI agreements. If everyone coordinates on safety standards (stag), the outcome is best. But if you commit to standards and others don't, you lose. The safe individual choice is to go alone (hare).

The difference from the Prisoner's Dilemma: cooperation is an equilibrium here — but it requires trust that others will also cooperate. The problem is coordination, not incentives.

Chicken (Escalation)

$$\begin{array}{c|cc} & \text{Swerve} & \text{Straight} \ \hline \text{Swerve} & (3,3) & (1,5) \ \text{Straight} & (5,1) & (0,0) \end{array}$$

Two equilibria: one player swerves, the other goes straight. Mutual straight = catastrophe.

AI safety analog: The race to AGI. Each company wants to be the one that goes straight (full speed) while the other swerves (invests in safety). If both go straight — unsafe AGI deployed by competitors racing to the bottom. Mutual catastrophe.

Tragedy of the Commons

Not a two-player game — an n-player game where each actor's individually rational exploitation of a shared resource degrades it for everyone. The "resource" in AI safety is the shared information environment, public trust in AI, or the stability of the global order.

Repeated Games and the Shadow of the Future

One-shot games often have bad equilibria. But if players interact repeatedly, cooperation can emerge:

Tit-for-tat: Cooperate first, then mirror what the other player did. Works well in iterated Prisoner's Dilemma.
The shadow of the future: If you'll interact again, defecting today has future costs. The longer the shadow, the more cooperation is sustainable.

AI safety implication: Short-term competitive dynamics (one-shot thinking) favor defection. Building long-term relationships between AI labs — shared safety research, mutual auditing — extends the shadow of the future and makes cooperation more rational.

But: if one player achieves decisive advantage (AGI), the game ends. The "shadow of the future" collapses, and defection becomes rational again. This is why AGI as a discontinuity is so dangerous from a game-theoretic perspective.

Mechanism Design: Engineering Better Games

If the game structure produces bad equilibria, change the game.

Mechanism design is the inverse of game theory: instead of analyzing a given game, you design the rules so that the equilibrium you want is the one players reach.

Liability laws: Make defection (cutting safety corners) expensive. If the expected cost of an AI accident exceeds the profit from shipping fast, the equilibrium shifts.
Verification and transparency: Reduce information asymmetry. If companies can verify each other's safety investments, the stag hunt becomes easier to coordinate.
Subsidies for safety research: Reduce the cost of cooperation. If safety is cheap, the payoff matrix changes.
International treaties with enforcement: Change the game from one-shot to repeated with punishment for defection.

The goal isn't to make players virtuous — it's to make the structure of the interaction produce good outcomes even with self-interested players.

Connection to AI Agents

Everything above applies to human players. But as AI systems become agents — making decisions, interacting with other agents — game theory applies to them directly:

Multi-agent RL is game theory with learned strategies
AI agents negotiating with each other face the same dilemmas
An AI that understands game theory can strategically manipulate interactions (→ connects to mesa-optimization)

This is why understanding these dynamics isn't optional for safety — it's foundational.

Why Game Theory Matters for AI Safety#

The Basics#

Players, Strategies, Payoffs#

Normal Form (Matrix Games)#

Nash Equilibrium#

Games That Show Up Everywhere in AI Safety#

The Prisoner's Dilemma (Safety Investment)#

The Stag Hunt (Coordination)#

Chicken (Escalation)#

Tragedy of the Commons#

Repeated Games and the Shadow of the Future#

Mechanism Design: Engineering Better Games#

Connection to AI Agents#