Prior reading: Game Theory for AI Safety | The AI Threat Landscape | P-Hacking and Benchmarks

The Core Problem

AI safety is short-term costly and long-term valuable. Every actor faces pressure to defect.

This post makes two claims. First, the motivation to prioritize safety is structurally lacking — everyone has reasons to cut corners. Second, even if we could fix motivation entirely, the regulatory problem is so hard that good intentions wouldn't be enough. Both have to be true for the situation to be as bad as it is. Unfortunately, both are.

Part I: Why Nobody Stops

International Competition

If Country A regulates AI development and Country B doesn't, Country B gets capabilities first. The perceived cost of falling behind in AI is existential for national security and economic competitiveness.

No country wants to be the one that regulated itself out of the race.

Corporate Competition

If Company A invests heavily in safety and Company B ships faster without it, Company B captures the market. Safety is a competitive disadvantage in the short run.

AI is short-term good for AI owners, bad for non-owners. Long-term, likely not controllable and bad even for owners. But owners must race — because if they don't, they won't even have the chance to be a long-term owner.

AI companies are also threatened by AGI. Rapid progress commoditizes current advantages. Today's moat is tomorrow's open-source library.

For Workers

Even in surviving fields, AI shifts bargaining power to employers. "Learn to use AI" is individual advice that doesn't solve the structural problem. Rapid progress and growing inequality may lead to zero-sum politics and global instability.

Selective Forces at Every Level

  • Models: Less constrained models can pursue more strategies. Alignment is a restriction on the strategy space.
  • Companies: Safety costs money, slows shipping, and doesn't show up in quarterly earnings.
  • Nations: Unilateral regulation = unilateral disarmament.

Each level creates selection pressure against safety.

The Paradox

Every individual actor is making a locally rational decision. The collective outcome is globally irrational. This is the textbook definition of a coordination failure — a tragedy of the commons.

Why People Don't Listen

  • The threat is abstract and future. The competition is concrete and now.
  • Safety advocates can't demonstrate the counterfactual (what disaster was prevented).
  • The people making decisions benefit from the status quo.
  • "It probably won't be that bad" is psychologically easier than "we need to coordinate globally on an unprecedented problem."

Part II: Why Good Intentions Aren't Enough

The section above explains why motivation is lacking. Now assume the opposite: everyone is motivated. Every nation, every company, every researcher genuinely wants to regulate AI well. It's still incredibly hard.

The Object You're Regulating Keeps Changing

Capability jumps are unpredictable. A model goes from "can't do X" to "can do X fluently" between training runs, sometimes between scale thresholds no one predicted. You can't write regulation for capabilities that don't exist yet but might next quarter.

Categories don't hold. Is an LLM a search engine? A tool? An agent? An author? A medical device? The legal categories we have were designed for objects that stay in one category. AI crosses categories depending on how it's used.

"Frontier" is a moving line. Regulation often targets "frontier models" or "models above X compute threshold." But today's frontier is next year's open-source commodity.

The Compute Threshold Trap

Current governance frameworks — the EU AI Act, US Executive Order 14110 — lean heavily on training compute as a regulatory trigger. Set the line at $10^{26}$ FLOPs and you've captured the frontier. The problem is that this treats risk as a function of production cost. It isn't.

Efficiency is eating the threshold. Algorithmic efficiency improves relentlessly. The compute required to reach a fixed capability level drops by roughly an order of magnitude per year. A static threshold doesn't slowly become outdated — it becomes irrelevant on a schedule.

Risk isn't one-dimensional. Danger emerges from the interaction of at least three properties:

  • Intelligence ($I$): How well the system reasons.
  • Generality ($G$): How broadly it can apply that reasoning.
  • Agenticity ($A$): How autonomously it can act.

The danger surface $D = f(I, G, A)$ is nonlinear and multi-dimensional. A single-axis threshold based on FLOPs collapses all of this into one number and hopes for the best.

The particularly worrying part is the interaction between intelligence and agenticity. High intelligence with low agenticity is probably manageable — that's a very smart tool. But there seems to be a regime where adding agenticity to an already-intelligent system causes the risk derivative to spike. A phase transition, not a gradual ramp. We don't know where it sits.

The sandbox doesn't hold. A highly intelligent model with restricted agenticity can generate proxy agency by persuading the humans who interact with it. It doesn't need API access if it can convince an operator to look something up. The model's effective agenticity isn't just what you grant it — it's what it can induce through its output channel. High intelligence can bootstrap agenticity through social engineering, even in a locked-down environment.

Edge Proliferation

All of the above assumes you can identify who to regulate.

When frontier capability required a datacenter, regulation had a natural enforcement point: a handful of actors. But models are getting dramatically cheaper at fixed quality. Open-weight releases mean last year's frontier is available for download. Quantization and distillation squeeze capable models onto consumer hardware. The enforcement surface is expanding from "a few big labs" to "anyone with a laptop."

Once capable models run on edge devices, you lose visibility. No API to audit, no cloud provider to subpoena. The model is just there, running locally.

Effective regulation needs chokepoints — points in the supply chain where you can observe and intervene:

  • Compute providers: Can monitor training runs, but miss local training and edge inference.
  • Model distributors: Can restrict downloads, but models leak and get pirated.
  • Hardware: Could restrict GPU sales, but massive collateral damage, geopolitical conflict, and doesn't prevent inference on older hardware.

None are airtight, and they all become leakier as efficiency improves. The history of controlling information technology through distribution bottlenecks — cryptography export controls, DRM, content filtering — is not encouraging.

You Can't Measure What You're Trying to Control

We can measure FLOPs, parameter count, benchmark scores. We cannot measure "how dangerous is this model" in any agreed-upon unit. Without measurement, regulation is guesswork.

Any safety benchmark becomes a target. If regulation requires passing Benchmark X, companies optimize for Benchmark X — Goodhart's law applied to governance.

And capabilities are dual-use. "Can reason about biology" is both a bioweapon risk and a drug discovery tool. There's no scalpel — only a sledgehammer.

The Regulatory Apparatus Is Too Slow

A major AI bill takes 1–3 years from proposal to law. A major AI capability can emerge in weeks. Rushing the legislative process means either writing bad law quickly or concentrating power in unelected agencies. Neither is obviously good.

The people who deeply understand AI are mostly employed by the companies being regulated. Regulatory bodies can't compete on salary. The result is a persistent information asymmetry between regulators and the regulated.

And even if one country gets it right, AI is global. International agreements require shared definitions, shared measurement, shared enforcement, and shared speed. None exist yet. Unlike nuclear nonproliferation, the "dangerous material" in AI is information — trivially easy to copy and distribute.

AI Changes the Policy Environment

AI-generated content floods public discourse. AI tools accelerate lobbying campaigns. AI automates policy analysis — and policy manipulation. The tool you're trying to regulate is actively reshaping the ground you're standing on. Cars don't make transportation policy harder to write. AI might genuinely make AI policy harder to write.

Even well-intentioned companies push for regulation they can comply with faster than competitors. The natural outcome is regulation that codifies current practice rather than pushing toward safety — protecting incumbents rather than the public.

What Might Actually Work

None of this means giving up. It means static, threshold-based, one-time legislation is the wrong tool.

  • Dynamic thresholds: If you use compute thresholds, they need to float. Peg reporting requirements to the current efficiency frontier — an index, updated annually, like inflation adjustments in tax law.
  • Capability-based regulation: Instead of regulating inputs (compute), regulate outputs (what the model can do). Define specific dangerous capabilities and require developers to evaluate for them.
  • Liability over prescription: Hold developers liable for harms rather than specifying what models can and can't do. The obligation ("don't cause harm") doesn't change even when capabilities do.
  • Red lines over bright lines: Define prohibited outcomes ("no model may enable autonomous bioweapon synthesis") and require developers to demonstrate compliance. Burden on developers, outcome-based.
  • Institutional infrastructure: Standing bodies with technical expertise and enforcement power that can act between legislative cycles. The FDA model — a permanent agency with domain expertise — is probably closer to right than periodic executive orders.
  • Making safety cheaper: Technical research that reduces the competitive penalty. If safety doesn't slow you down, the race dynamic weakens.
  • International agreements with verification: Regulation that applies to all competitors simultaneously.
  • Transparency: Making cutting corners visible.

The Honest Assessment

Even with universal motivation, AI policy is a problem of regulating a moving target you can't measure, using tools that are too slow, with thresholds that erode exponentially, in a landscape where the target is proliferating to the edge and reshaping the policy environment as it goes.

Without universal motivation — which is the world we actually live in — add to that list: every actor has incentives to defect, the people making decisions benefit from the status quo, and the threat is abstract enough that "it probably won't be that bad" is always the easier position.

Policy alone is insufficient. Governance has to be paired with technical safety research that makes the regulatory problem more tractable, and it has to be designed for adaptation rather than permanence. The worst outcome is a false sense of security — confident regulation that was accurate for about six months.