Prior reading: Mesa-Optimization and Three Lenses | Game Theory for AI Safety
Part I: What Does "Safe" Even Mean?
"Make AI safe" is meaningless without specifying: safe for whom, against what threat, under what conditions?
Who Is the User?
- Public: Lowest common denominator. Must handle naive, careless, and adversarial users simultaneously.
- Internal / enterprise: Can assume some training, access controls, and monitoring.
- Knowledgeable human: Researchers, developers. Different failure modes matter.
Who Is the Adversary?
- No adversary: Accidental misuse, honest mistakes. The easiest case.
- Casual adversary: Jailbreaking for fun, social engineering. Medium difficulty.
- Sophisticated adversary: State actors, determined attackers with resources. The hard case.
What Are We Protecting?
- Users from the model: Preventing harmful outputs.
- The model from users: Preventing extraction, manipulation, prompt injection.
- Society from the system: Preventing large-scale harms (economic disruption, disinfo).
- The future from the present: Preventing lock-in, power concentration, existential risk.
Safety claims without a threat model are empty. A system "safe" for internal research may be wildly unsafe for public deployment.
Areas That May Be Structurally Safer
Some domains resist AI disruption: those requiring physical presence and diversity, high-trust relationships, or tasks that are cheap for humans but expensive to automate. Understanding these helps prioritize where safety work matters most.
Part II: The Four Threat Vectors
1. Economic Threats
Wealth inequality and the shift toward capital. AI automates labor. Returns shift from wages to capital ownership. Those who own the AI systems capture the value; those who don't lose bargaining power.
Erosion of worker power — even in surviving fields. Even fields not fully automated face AI-augmented competition. Workers must adopt AI or be outcompeted. This shifts leverage to employers and platform owners.
There is growing evidence that AI makes people perform worse in some contexts — deskilling, over-reliance, and automation complacency are real. The productivity gains may be unevenly distributed and partially illusory.
Race to the bottom for companies. AI companies are also threatened by AGI. Rapid progress commoditizes today's competitive advantages. The resulting inequality and instability may feed zero-sum politics and global instability.
2. Authoritarian Enablement
Mass mis/disinformation. AI-generated content at scale overwhelms human ability to distinguish truth from fabrication.
Surveillance state. AI-powered surveillance enables monitoring at a scale previously impossible. Facial recognition, behavioral prediction, social scoring.
Reduced necessity of people. Authoritarian regimes historically needed populations for production and military. AI reduces both dependencies, potentially reducing incentives to maintain civil liberties.
AI as a societal risk: there is concerning data on AI making people perform worse at tasks requiring independent judgment — this has implications for democratic participation and resistance to authoritarianism.
3. Misuse of AGI
Bioweapons. AI lowers the expertise barrier for designing dangerous biological agents.
Cyber. AI-powered vulnerability discovery and exploit generation at scale.
Other force multipliers. Any domain where AI amplifies the capability of small, malicious actors.
4. Existential Risk
Hard to regulate. International competition makes unilateral regulation costly. Selective pressure on nations: regulate and fall behind, or race and accept risk.
Selective pressure at three layers:
- Model vs. model: Less aligned models may have strategic advantages (more options, fewer constraints)
- Nation vs. nation: Countries that deploy unsafe AI faster gain economic/military advantages
- Corporation vs. corporation: Companies that cut safety corners ship faster
Misalignment as structural advantage. An unconstrained optimizer has strictly more strategies available than a constrained one. This creates pressure against alignment at every level of competition.
Mesa/meta-optimization. Training processes that produce mesa-optimizers with misaligned goals. (→ see mesa-optimization post)
P-hacking safety. Optimizing for benchmark appearance rather than actual safety. (→ see p-hacking post)
Part III: Why X-Risk Deserves Serious Attention
The Expected Value Argument
Even a small probability of existential catastrophe, multiplied by the magnitude of the loss (all future human potential), produces a large expected disvalue. This makes x-risk worth significant investment even under uncertainty.
Why Smart People Disagree
- Skeptics: Current AI is far from AGI. Extrapolating from language models to existential risk is premature.
- Concerned: The trajectory is steep. We don't know when capabilities will become dangerous, and safety research takes longer than capabilities research.
- Very concerned: Competitive dynamics make the problem structurally hard to solve even if everyone agrees it's important.
The Structural Problem
The core issue isn't any single model. It's the competitive dynamics:
- Multiple actors (nations, companies) race to develop more capable AI
- Safety is a cost that slows you down
- The actor who cuts safety corners gets capabilities first
- "First" matters because AI capability may be a decisive strategic advantage
- Therefore, competitive pressure selects against safety
This is a multi-level collective action problem. It exists at the model level, the corporate level, and the international level simultaneously.
What Would Help
- International coordination (hard but necessary)
- Technical safety research that makes safety less costly (reduces the competitive penalty)
- Better measurement and evaluation of dangerous capabilities
- Governance structures that can move at the speed of development