Game Theory: Why Coordination on Safety Is So Hard

This post builds up the game theory you need to reason precisely about AI safety dynamics — races, coordination failures, and regulatory design. If you already know Nash equilibria, you can skip to Games That Show Up in AI Safety. If you're here for solutions, skip to Mechanism Design. But I'd recommend reading the whole thing, because the point isn't any single definition — it's seeing why the structure of the interaction, not the intentions of the players, drives the outcome.

What Makes Something a Game?

Most of the math you encounter in machine learning is optimization: one agent, one objective, minimize or maximize. Gradient descent, reward maximization, loss minimization — these are all single-agent problems. You choose your action, and the outcome depends only on what you did and what the environment does. The environment doesn't strategize against you.

A game is different. There are multiple agents, and the outcome for each agent depends on what everyone does. Your best choice depends on what the other players choose, and their best choice depends on yours. This circular dependency is what makes strategic interaction fundamentally harder than optimization — and it's why good intentions can produce terrible outcomes.

Here's the basic setup. A game has:

Players: The agents making decisions. These could be companies, nations, researchers, or AI systems.
Strategies: The choices available to each player.
Payoffs: What each player receives for each combination of choices. The payoff to player $i$ depends not just on player $i$'s strategy but on the strategy profile of all players.

Formally, an $n$-player game in normal form is a tuple $(\mathcal{N}, {S_i}{i \in \mathcal{N}}, {u_i}{i \in \mathcal{N}})$ where $\mathcal{N} = {1, \ldots, n}$ is the set of players, $S_i$ is the strategy set for player $i$, and $u_i : S_1 \times \cdots \times S_n \to \mathbb{R}$ is the payoff function for player $i$.

The key thing to notice: $u_i$ takes the entire strategy profile as input. This is what distinguishes a game from an optimization problem. In optimization, the objective depends only on your own decision variable. In a game, your payoff is a function of everyone's decisions simultaneously.

Two Players, Two Choices: Matrix Games

The simplest interesting games have two players, each with two strategies. We can write the payoffs in a matrix where rows are Player 1's strategies, columns are Player 2's, and each cell contains a pair $(a, b)$ — Player 1 gets $a$, Player 2 gets $b$.

Let's start with a simple example that has nothing to do with AI. Two friends are choosing where to meet for lunch. Each can go to Restaurant A or Restaurant B. They'd rather be at the same restaurant than eat alone, but Player 1 prefers A and Player 2 prefers B:

$$\begin{array}{c|cc} & A & B \ \hline A & (3,2) & (0,0) \ B & (0,0) & (2,3) \end{array}$$

If they both go to A, Player 1 is happier (3 vs 2). If both go to B, Player 2 is happier. If they go to different restaurants, they both get 0. This is a coordination game — the main challenge isn't conflicting interests, it's coordinating on one of two good outcomes.

Dominant Strategies

Sometimes you don't need to think about what the other player will do. A strategy is dominant if it gives a higher payoff than any alternative, regardless of what the other player chooses.

Consider this game:

$$\begin{array}{c|cc} & C & D \ \hline C & (3,3) & (0,5) \ D & (5,0) & (1,1) \end{array}$$

Look at it from Player 1's perspective. If Player 2 plays C, Player 1 gets 3 from C but 5 from D — so D is better. If Player 2 plays D, Player 1 gets 0 from C but 1 from D — D is still better. No matter what Player 2 does, D gives Player 1 a higher payoff. D is dominant for Player 1.

The same logic applies to Player 2 (the game is symmetric). So both players choose D, and the outcome is $(1,1)$.

But look at the matrix: $(3,3)$ is available if they both play C. Both players would prefer mutual cooperation to mutual defection. The dominant strategy leads to an outcome that both players agree is worse than an available alternative.

This is the Prisoner's Dilemma. Your first instinct might be: "The players should just agree to cooperate." That instinct is natural and misses the point. The PD isn't about what the players should do. It's about what the structure of the game forces them to do. The structure punishes unilateral cooperation and rewards defection regardless of what the other player chooses. This distinction — between what's collectively optimal and what's individually rational — turns out to be the single most important idea for understanding AI safety dynamics.

Nash Equilibrium

Not every game has dominant strategies. The Prisoner's Dilemma is special because the answer is obvious (D dominates) even though the outcome is bad. Most games aren't that clean. We need a more general solution concept.

A Nash equilibrium is a strategy profile where no player can improve their payoff by unilaterally changing their strategy. If I told you what everyone else is playing, you'd have no incentive to deviate. Formally, a strategy profile $(s_1^, \ldots, s_n^)$ is a Nash equilibrium if for every player $i$ and every alternative strategy $s_i \in S_i$:

$$u_i(s_i^, s_{-i}^) \geq u_i(s_i, s_{-i}^*)$$

where $s_{-i}^*$ denotes the strategies of all players except $i$.

In the Prisoner's Dilemma, $(D, D)$ is the unique Nash equilibrium. Neither player can improve by switching to C — they'd go from payoff 1 to payoff 0. The fact that $(C, C)$ would be better for both is irrelevant to the equilibrium concept. Nash equilibrium asks "can anyone improve by deviating alone?" not "can everyone improve by coordinating?"

This distinction matters enormously.

Nash's Theorem

John Nash proved (1950) that every finite game — finite players, finite strategies — has at least one Nash equilibrium, possibly in mixed strategies (where players randomize over their options). This is a foundational result: it guarantees that the equilibrium concept isn't vacuous. There's always at least one stable configuration.

Mixed strategies are worth working through concretely, because the result is counterintuitive. In the restaurant coordination game above, there are two pure equilibria: (A, A) and (B, B). But there's also a mixed equilibrium. Let Player 1 choose A with probability $p$ and Player 2 choose A with probability $q$. In a mixed equilibrium, each player must be indifferent between their options (otherwise they'd switch to the better one).

Player 1 is indifferent when their expected payoff from A equals their expected payoff from B:

$$3q + 0(1-q) = 0 \cdot q + 2(1-q)$$ $$3q = 2 - 2q$$ $$q = \frac{2}{5}$$

By symmetry (swapping the 3 and 2), $p = \frac{3}{5}$.

So in the mixed equilibrium, Player 1 goes to A 60% of the time and Player 2 goes to A 40% of the time. What's the expected payoff? Player 1 gets $3 \times \frac{3}{5} \times \frac{2}{5} + 2 \times \frac{2}{5} \times \frac{3}{5} = \frac{18}{25} + \frac{12}{25} = \frac{30}{25} = 1.2$.

Compare that to the pure equilibria: (A, A) gives Player 1 a payoff of 3, and (B, B) gives 2. The mixed equilibrium gives 1.2 — worse than either coordinated outcome. This isn't a fluke. Mixed equilibria in coordination games are often the worst outcomes. They represent the failure mode where neither player can predict the other, so they waste effort going to different restaurants.

This pattern — the mixed equilibrium being collectively terrible — shows up whenever coordination is the central problem.

Multiple Equilibria and the Selection Problem

Some games have multiple Nash equilibria, and the players may not agree on which one to play. This is the equilibrium selection problem, and it's one of the deepest difficulties in game theory. Having a good equilibrium available is useless if you can't coordinate on it.

Thomas Schelling introduced the concept of focal points (in The Strategy of Conflict, 1960) — equilibria that stand out due to cultural, contextual, or structural salience. If two people need to meet in New York but can't communicate, they might both go to Grand Central Station at noon. Not because it's strategically optimal, but because it's the "obvious" choice that both expect the other to find obvious.

This concept turns out to be directly relevant to AI governance. The compute thresholds in regulatory proposals (e.g., the $10^{26}$ FLOPs reporting threshold in Executive Order 14110, October 2023) function as Schelling points: somewhat arbitrary numbers, but their virtue is simplicity and observability. Everyone can agree on what the number means, even if they disagree on whether it's the right number. Schelling's insight was that in coordination games, simplicity and salience determine which equilibrium players reach — and a round number that everyone has heard of beats a carefully optimized threshold that no one remembers.

Games That Show Up in AI Safety

Now that we have the vocabulary, here's the key point: AI safety isn't facing one game-theoretic problem. It's facing several different ones simultaneously, and they require different solutions. Confusing them — treating a coordination problem like an incentive problem, or vice versa — leads to interventions that don't work. The point of formalizing these isn't pedantic precision. It's to see that the problems are structural and that correctly diagnosing the structure tells you what kind of intervention has a chance.

The Prisoner's Dilemma: Safety Investment

Two AI companies each decide how much to invest in safety. Investing is costly and slows development. Not investing lets you ship faster.

$$\begin{array}{c|cc} & \text{Invest} & \text{Skip} \ \hline \text{Invest} & (3,3) & (0,5) \ \text{Skip} & (5,0) & (1,1) \end{array}$$

If both invest in safety, the industry develops more slowly but more safely — a good collective outcome $(3,3)$. If one invests and the other skips, the skipper ships first, captures the market, and the safety-conscious company eats the cost for nothing $(0,5)$. If both skip, everyone races to the bottom $(1,1)$.

Skipping safety is a dominant strategy. This is why the race-to-the-bottom dynamic in AI isn't a failure of values — it's a failure of game structure. Even if every company genuinely wants to be safe, the incentive structure punishes unilateral safety investment.

This pattern extends beyond companies. Nations deciding whether to regulate AI face the same dilemma: unilateral regulation risks driving development to less regulated jurisdictions. Researchers deciding whether to publish dangerous capabilities face it too: withholding doesn't help if someone else publishes first.

Armstrong, Bostrom, and Shulman formalized this in "Racing to the Precipice" (2016). Their model shows that under competitive pressure, rational actors accept risks they know are excessive — because the cost of losing the race exceeds the expected cost of a safety failure discounted by its probability. The key result: in equilibrium, all participants take more risk than any of them would prefer, and the probability of catastrophe increases with the number of competitors and the magnitude of the winner's advantage.

For a deeper analysis of how this dynamic plays out across multiple levels — companies, nations, researchers, and open-source communities — see the competitive dynamics post.

The Stag Hunt: Coordination Under Uncertainty

$$\begin{array}{c|cc} & \text{Stag} & \text{Hare} \ \hline \text{Stag} & (4,4) & (0,3) \ \text{Hare} & (3,0) & (3,3) \end{array}$$

The Stag Hunt (from Rousseau) has a different structure from the Prisoner's Dilemma. There are two Nash equilibria: (Stag, Stag) and (Hare, Hare). Mutual stag-hunting is better for everyone, but it requires trust. If you hunt stag and the other player hunts hare, you get nothing. Hunting hare is safe — you get a guaranteed 3 regardless.

Notice the difference from the PD: cooperation (hunting stag) is an equilibrium here. The problem isn't that individual incentives make cooperation irrational — it's that cooperation is risky. If you're uncertain whether the other player will cooperate, the safe choice is to defect.

AI safety analog: international safety agreements. Suppose every major nation would benefit from coordinated AI safety standards. The globally optimal outcome is that everyone adopts them. But if your nation commits to safety standards and others don't, you've handicapped your AI industry for nothing. The safe choice is to go it alone.

The Stag Hunt is subtler than the Prisoner's Dilemma because it highlights a different failure mode: not misaligned incentives, but insufficient trust. The solution isn't to change the payoffs (as in the PD) but to build confidence that others will cooperate. Verification mechanisms, transparency commitments, and phased adoption all address this by reducing uncertainty about what other players will do.

Chicken: Brinkmanship and Escalation

$$\begin{array}{c|cc} & \text{Swerve} & \text{Straight} \ \hline \text{Swerve} & (3,3) & (1,5) \ \text{Straight} & (5,1) & (0,0) \end{array}$$

Two drivers race toward each other. Each can swerve or go straight. If both swerve, they're fine $(3,3)$. If one swerves and the other goes straight, the swerving driver is humiliated $(1,5)$. If neither swerves, both die $(0,0)$.

There are two pure Nash equilibria: (Swerve, Straight) and (Straight, Swerve). Each player wants to be the one who goes straight while the other swerves. But if both try to be the tough one, the result is catastrophic.

AI safety analog: the race to AGI. Multiple labs are pushing toward increasingly powerful systems. Each would prefer to be the one making the breakthrough while others proceed cautiously. But if everyone races at full speed with minimal safety investment, the result could be premature deployment of a system no one understands — the mutual crash.

What makes Chicken particularly dangerous is that commitment becomes a weapon. If one player can credibly commit to going straight — by publicly announcing they won't slow down, by tying their reputation to speed — the other player is forced to swerve. In the AI context, companies that publicly commit to aggressive timelines create pressure on competitors to either match or concede the field.

The mixed-strategy equilibrium in Chicken involves randomizing, which means crashes happen with positive probability. This is an uncomfortable but precise way to model situations where everyone is partially committed to risky strategies and no one fully backs down.

The Tragedy of the Commons

This is an $n$-player game where each actor's individually rational exploitation of a shared resource degrades it for everyone. The "resource" varies by context, and I think AI safety has at least three distinct commons being degraded simultaneously. Public trust in AI is one: each company that deploys a sloppy chatbot or a hallucination-prone search tool erodes confidence in AI systems generally, but the cost is shared across the industry while the deployment benefit is captured by the individual company. The information environment is another: each deployment of convincing AI-generated misinformation makes the entire ecosystem less trustworthy for everyone. And safety research itself is a commons — it benefits the whole field, but the company that does it bears the cost while competitors free-ride on the results.

Garrett Hardin's original framing (1968) emphasized that rational individual behavior leads to collective ruin when resources are shared and unregulated. The formal structure is an $n$-player Prisoner's Dilemma: each player's dominant strategy is to exploit, even though universal restraint would be better for everyone.

Elinor Ostrom (Nobel Prize, 2009) showed that real communities often solve commons problems through informal institutions, monitoring, and graduated sanctions — without either privatization or top-down regulation. Her work suggests that AI safety governance might be achievable through community-level coordination rather than requiring a global treaty.

The Frontier Model Forum (founded July 2023 by OpenAI, Anthropic, Google DeepMind, and Microsoft) is an attempt at exactly this kind of self-governance. But the game-theoretic assessment is sobering. In mechanism design terms, the Forum is a cheap talk institution — it enables communication and coordination but doesn't change payoffs. Voluntary commitments without third-party verification or penalties for defection are cheap talk by definition, even if made in good faith. The question is whether repeated interaction and reputational stakes create enough of a "shadow of the future" to make the cheap talk credible. As of early 2025, the Forum had produced shared evaluation work but no binding commitments.

One-Shot vs. Repeated Games

Everything above assumes the game is played once. This is often realistic for questions like "do we deploy AGI safely?" — you may only get one shot. But many AI safety interactions are repeated: companies compete quarter after quarter, nations negotiate year after year, researchers interact over careers.

The Shadow of the Future

In a one-shot Prisoner's Dilemma, defection is the unique Nash equilibrium. But in an infinitely repeated Prisoner's Dilemma, cooperation can be sustained as an equilibrium — provided the players care enough about the future.

The key parameter is the discount factor $\delta \in [0,1)$, which measures how much players value future payoffs relative to the present. The folk theorem (proved in various forms by Aumann and Shapley, Rubinstein, and Fudenberg and Maskin in the 1970s-80s) says: if $\delta$ is high enough, any individually rational payoff profile can be sustained as a Nash equilibrium of the repeated game.

For the Prisoner's Dilemma with payoffs as above, a strategy like tit-for-tat (cooperate first, then do whatever the other player did last round) sustains cooperation as an equilibrium when:

$$\delta \geq \frac{5 - 3}{5 - 1} = \frac{1}{2}$$

Let me walk through this to make sure the intuition is solid. Suppose both players are using tit-for-tat (cooperate first, then mirror). They cooperate forever and each get 3 per round, for a total discounted payoff of $\frac{3}{1-\delta}$.

Now suppose Player 1 considers defecting in round 1. They get 5 instead of 3 — a gain of 2. But Player 2 retaliates in round 2, and they end up in mutual defection getting 1 per round forever after. The total payoff from defecting is $5 + \frac{\delta}{1-\delta} \cdot 1$.

Cooperation is better when:

$$\frac{3}{1-\delta} \geq 5 + \frac{\delta}{1-\delta}$$

Working this out: $3 \geq 5(1-\delta) + \delta = 5 - 4\delta$, so $\delta \geq \frac{1}{2}$.

If $\delta = 0.9$ (players value the future highly), the cooperating payoff is $\frac{3}{0.1} = 30$ versus the defecting payoff of $5 + \frac{0.9}{0.1} = 14$. Cooperation wins easily. If $\delta = 0.3$ (short-term thinking), the cooperating payoff is $\frac{3}{0.7} \approx 4.3$ versus defecting at $5 + \frac{0.3}{0.7} \approx 5.4$. Defection wins. The shadow of the future is too short to sustain cooperation.

Robert Axelrod's famous tournaments (reported in The Evolution of Cooperation, 1984) showed that tit-for-tat performed remarkably well across a wide range of strategies. It won by being nice (never defecting first), retaliatory (punishing defection immediately), forgiving (returning to cooperation if the other player does), and clear (its behavior was easy to predict).

Why This Matters for AI Safety

Long-term lab relationships promote safety cooperation. If AI labs expect to compete for decades, the shadow of the future is long, and cooperative equilibria on safety are sustainable. Shared safety research, mutual auditing agreements, and transparent reporting of incidents all create the conditions for repeated-game cooperation.

But AGI is a potential game-ender. If one lab achieves a decisive capability advantage, the repeated game effectively ends — the dominant lab no longer needs cooperation, and the shadow of the future collapses. This is why the prospect of a discontinuous capability jump is so dangerous from a game-theoretic perspective: it destroys the conditions that sustain cooperation.

This also explains why many governance proposals focus on maintaining parity between major labs and nations. If no single actor can achieve dominance, the game remains repeated, and cooperative equilibria remain accessible. Arms control agreements during the Cold War served a similar function: by limiting first-strike capability, they kept the "game" going and made cooperation rational.

The Finite Horizon Problem

There's a subtlety worth noting. If the game is repeated a known, finite number of times, backward induction unravels cooperation completely. In the last round, there's no future to threaten, so both players defect. But if both will defect in the last round, there's no reason to cooperate in the second-to-last round. The logic cascades backward, and the unique Nash equilibrium is to defect in every round.

This backward induction argument creates an awkward theoretical prediction: finite repeated PDs should have no cooperation at all. Empirically, humans cooperate extensively in finite repeated PDs — the unraveling argument is too brittle.

The resolution involves incomplete information. If there's even a small probability that the other player uses a conditionally cooperative strategy (like tit-for-tat), cooperation can be sustained for most of the game, breaking down only near the end. Kreps, Milgrom, Roberts, and Wilson (1982) formalized this in a well-known result.

AI safety implication: Even if the "game" between AI labs has a finite horizon (regulation coming, AGI achieved, etc.), cooperation can still be sustained if each lab believes the others might be genuinely committed to safety, rather than merely signaling. Maintaining this uncertainty — keeping open the possibility of genuine cooperation — is itself valuable.

A Real Example: Responsible Scaling Policies

Anthropic published its Responsible Scaling Policy in September 2023 — the first explicit framework tying capability levels to safety requirements. OpenAI's Preparedness Framework followed in December 2023, and DeepMind's Frontier Safety Framework in May 2024.

This sequential adoption has the structure of tit-for-tat: one lab cooperates (publishes binding commitments), and others reciprocate. But the game-theoretic assessment is mixed. All commitments are self-assessed and self-enforced, with no external verification. In repeated-game terms, these are somewhere between genuine cooperation and cheap talk — they create internal organizational constraints that make defection costlier, but without third-party auditing, the constraints are only as strong as the organization's willingness to enforce them against itself.

Incomplete Information and Bayesian Games

In real strategic interactions, you often don't know the other player's payoffs. Is the other AI company genuinely committed to safety, or are they performing safety theater to buy time? Does a rival nation actually have advanced AI capabilities, or are they bluffing?

John Harsanyi (Nobel Prize, 1994) introduced Bayesian games to handle this. Each player has a type (drawn from a known probability distribution) that determines their payoffs. Players know their own type but not others'. The solution concept is Bayesian Nash equilibrium: a strategy for each type such that no player-type wants to deviate, given beliefs about others' types.

This formalism captures something important about AI governance: much of the difficulty comes from uncertainty about what other actors are actually doing. A Bayesian game perspective suggests that information-sharing mechanisms — audits, compute monitoring, standardized benchmarks — aren't just nice to have. They change the game by reducing type uncertainty, which can shift the equilibrium toward cooperation.

Mechanism Design: Engineering Better Games

Everything up to this point has been somewhat depressing. The Prisoner's Dilemma says cooperation is irrational. The Stag Hunt says cooperation requires trust you might not have. Chicken says brinkmanship invites catastrophe. The commons is exploited. If the analysis stopped here, the conclusion would be: we're stuck.

But it doesn't stop here, and this is the part of game theory I find most useful. If the game structure produces bad equilibria, you can change the game.

Mechanism design (sometimes called "reverse game theory") is the study of designing rules, institutions, and incentives so that the resulting game has a desirable equilibrium. Instead of taking the game as given and solving for player behavior, you choose the game to produce the behavior you want.

Leonid Hurwicz, Roger Myerson, and Eric Maskin received the 2007 Nobel Prize for their foundational work on mechanism design. The central question is: given that players have private information and self-interested motivations, can you design the rules of interaction so that the equilibrium outcome is efficient, fair, or satisfies some other desirable property?

The Revelation Principle

A key theoretical result is the revelation principle (Myerson, 1981): for any mechanism that achieves a given outcome in equilibrium, there exists a direct mechanism where players simply report their types truthfully and achieve the same outcome. This massively simplifies the design problem — instead of searching over all possible mechanisms, you can restrict attention to direct, truthful ones.

This sounds abstract, but it has a concrete implication for AI governance: if you're designing a regulatory system, you don't need to consider every possible scheme for eliciting information about AI capabilities. You can focus on systems where labs truthfully report their capabilities and the mechanism works correctly given truthful reports.

The catch is that truthful reporting needs to be incentive-compatible — each lab must find it in their interest to report honestly. This is the hard part.

Applying Mechanism Design to AI Safety

The game-theoretic problems in AI safety are structural: the Prisoner's Dilemma, the Stag Hunt, the commons problem. Mechanism design says: don't try to change the players' values — change the rules so that the equilibrium under self-interested play produces good outcomes.

Liability and penalties: If deploying an unsafe AI system carries large enough expected liability, the payoff matrix changes. Defection (skipping safety) becomes costly. This transforms a Prisoner's Dilemma into a game where cooperation is individually rational. The EU AI Act and similar regulatory frameworks attempt exactly this — attaching costs to unsafe deployment that shift the equilibrium.

Verification and monitoring: Much of the difficulty in AI safety cooperation is a Stag Hunt problem — actors would cooperate if they trusted others to do the same. Verification mechanisms (compute monitoring, model evaluations, standardized audits) reduce uncertainty, converting the Stag Hunt into a game where cooperation is the obvious choice.

This connects to the arms control literature. The IAEA's role in nuclear nonproliferation is essentially mechanism design: inspections and monitoring change the game from one where nations must guess at each other's compliance to one where violations are detectable, making cooperation sustainable. Ho et al. ("International Institutions for Advanced AI," 2023) surveyed institutional models and proposed specific designs for AI governance drawing on the IAEA, CERN, IPCC, and ICAO as templates.

Subsidies for safety research: If the cost of safety investment is the barrier (it makes the "cooperate" payoff too low), subsidies can directly change the payoff matrix. Government funding for AI safety research reduces the cost of cooperation and widens the gap between "cooperate" and "defect" payoffs.

Compute governance: Compute is the most monitorable input to AI development — it's physical, energy-intensive, and produced by a concentrated supply chain. Heim et al. ("Computing Power and the Governance of Artificial Intelligence," Science, 2024) identified three governance functions that compute enables: visibility (tracking who uses large clusters), allocation (influencing who gets compute and for what), and enforcement (conditioning access on compliance). Proposals like know-your-customer requirements for cloud compute and reporting thresholds for large training runs leverage this observability to create verification mechanisms. The logic is mechanism design: use the monitorable input as a lever to make the game structure more amenable to cooperation.

Limits of Mechanism Design

I don't want to oversell this. Mechanism design is powerful in theory but faces real constraints, and I think it's worth being honest about them.

The first is enforcement. A mechanism only works if players believe penalties will actually be imposed. The EU AI Act can fine companies — that's enforceable. An international treaty with no enforcement body is just a coordination device. International AI governance currently looks more like the latter than the former.

The second is information. Effective mechanisms require some ability to observe the relevant actions. If safety investments are unobservable — if you can't tell whether a company actually did the safety testing it claimed to — no mechanism can condition on them directly. This is why compute governance is so attractive: compute is observable in a way that "safety effort" isn't.

The third is participation. Players have to actually play the game. If a mechanism is too costly or restrictive, actors exit — moving development to unregulated jurisdictions or open-source ecosystems where the mechanism has no reach. This is the fundamental tension in AI governance: strict enough to matter, permissive enough that people don't route around it.

There are also theoretical limits. The Gibbard-Satterthwaite theorem says no voting mechanism is both strategy-proof and non-dictatorial for three or more alternatives. Perfect outcomes aren't always achievable even in principle. The question isn't whether the mechanism is perfect — it's whether it shifts the equilibrium enough to matter.

Games Between AI Systems

Everything above treats the players as humans, companies, or nations. But here's something I find increasingly important to think about: as AI systems become agents — making decisions, interacting with other AI systems, negotiating on behalf of users — game theory applies to them directly. The games aren't just metaphors anymore. They're literal.

When multiple RL agents learn simultaneously in a shared environment, they face game-theoretic dynamics. Each agent's "environment" includes the other agents, whose policies are also changing. This makes the learning problem non-stationary and creates challenges that don't arise in single-agent RL.

What I find striking about the empirical results is how closely they track classical theory — RL agents independently rediscover the dynamics that game theorists predicted decades ago.

Leibo et al. ("Multi-agent Reinforcement Learning in Sequential Social Dilemmas," AAMAS 2017, DeepMind) studied agents in grid-world environments with commons-dilemma payoff structures. In their "Gathering" environment, agents could cooperate (share resources) or defect (attack opponents to steal resources). The result was striking: agents learned to defect as resources became scarce. When apples were abundant, agents coexisted peacefully. When scarcity increased, they turned aggressive. More capable agents were more likely to defect, because they could get away with it. Scarcity drives defection, and capability amplifies it — exactly as game theory predicts, but demonstrated emergently in learned agents.

OpenAI's hide-and-seek work ("Emergent Tool Use from Multi-Agent Autocurricula," Baker et al., 2019) demonstrated emergent strategic behavior in a richer environment. Through competitive self-play between hiders and seekers, agents discovered six distinct strategies in sequence — from simple running and chasing, to fort-building, to ramp exploitation, to defensive counter-strategies. Each new strategy emerged as a best response to the opponent's previous strategy. Agents learned to use tools (ramps, boxes, walls) that were never explicitly incentivized. This is a co-evolutionary arms race playing out in real time, and it's a concrete demonstration that multi-agent competition produces qualitatively novel behaviors that designers didn't anticipate.

Perhaps most concerning for real-world deployment: Calvano et al. ("Artificial Intelligence, Algorithmic Pricing, and Collusion," American Economic Review, 2020) showed that Q-learning agents trained to maximize profit in repeated pricing games learned to collude without explicit communication. The agents converged on supra-competitive prices and developed reward-punishment strategies resembling tit-for-tat to sustain collusion — all without being programmed to collude. This is game-theoretic cooperation emerging from pure optimization, in a setting (pricing) with immediate economic consequences.

Why This Matters

If AI systems are deployed as autonomous agents — negotiating contracts, managing resources, trading in markets — their strategic interactions become real games with real consequences. The question of whether they cooperate or defect in social dilemmas isn't academic; it determines whether AI-mediated interactions lead to coordination or conflict.

An AI agent that understands game theory can strategically manipulate interactions — committing to threats, exploiting cooperative opponents, or creating information asymmetries. Whether this is a capability to be developed or a risk to be mitigated depends on the context, but it's a game-theoretic problem either way.

The Point

Game theory is the study of multi-agent strategic interaction. Its central lesson for AI safety is that bad outcomes can arise from good intentions when the game structure is wrong. The Prisoner's Dilemma produces mutual defection even when every player would prefer cooperation. The Stag Hunt produces coordination failure even when a cooperative equilibrium exists. Chicken produces catastrophe even when everyone wants to avoid it.

The constructive response isn't despair — it's mechanism design. If the game produces bad equilibria, change the game. Liability laws, verification mechanisms, compute governance, and international institutions are all tools for engineering better game structures. They don't require players to be virtuous. They require the rules to make virtue strategically rational.

Understanding these dynamics isn't optional for anyone working on AI safety. Whether you're designing training objectives, proposing regulation, or building multi-agent systems, you're participating in a game. The question is whether you understand its structure well enough to change it.

Written by Austin T. O'Quinn. If something here helped you or you think I got something wrong, I'd like to hear about it — oquinn.18@osu.edu.

What Makes Something a Game?#

Two Players, Two Choices: Matrix Games#

Dominant Strategies#

Nash Equilibrium#

Nash's Theorem#

Multiple Equilibria and the Selection Problem#

Games That Show Up in AI Safety#

The Prisoner's Dilemma: Safety Investment#

The Stag Hunt: Coordination Under Uncertainty#

Chicken: Brinkmanship and Escalation#

The Tragedy of the Commons#

One-Shot vs. Repeated Games#

The Shadow of the Future#

Why This Matters for AI Safety#

The Finite Horizon Problem#

A Real Example: Responsible Scaling Policies#

Incomplete Information and Bayesian Games#

Mechanism Design: Engineering Better Games#

The Revelation Principle#

Applying Mechanism Design to AI Safety#

Limits of Mechanism Design#

Games Between AI Systems#

Why This Matters#

The Point#