Policy

Why AI Policy Is Hard Even When Everyone Agrees

Prior reading: Competitive Dynamics and Safety | P-Hacking and Benchmarks The Premise The competitive-dynamics post explains why motivation for AI safety is lacking. This post assumes the opposite: everyone is motivated. Every nation, every company, every researcher genuinely wants to regulate AI well. It's still incredibly hard. Here's why. The Object You're Regulating Keeps Changing Capability Jumps Are Unpredictable Regulation assumes you can define what you're regulating. But AI capabilities change discontinuously. A model goes from "can't do X" to "can do X fluently" between training runs, sometimes between scale thresholds no one predicted. ...

Why AI Safety Is Hard to Sell: Competitive Dynamics and the Race to the Bottom

Prior reading: Game Theory for AI Safety | The AI Threat Landscape The Core Problem AI safety is short-term costly and long-term valuable. Every actor faces pressure to defect. International Competition If Country A regulates AI development and Country B doesn't, Country B gets capabilities first. The perceived cost of falling behind in AI is existential for national security and economic competitiveness. No country wants to be the one that regulated itself out of the race. ...

The AI Threat Landscape: What 'Safe' Means and What We're Afraid Of

Prior reading: Mesa-Optimization and Three Lenses | Game Theory for AI Safety Part I: What Does "Safe" Even Mean? "Make AI safe" is meaningless without specifying: safe for whom, against what threat, under what conditions? Who Is the User? Public: Lowest common denominator. Must handle naive, careless, and adversarial users simultaneously. Internal / enterprise: Can assume some training, access controls, and monitoring. Knowledgeable human: Researchers, developers. Different failure modes matter. Who Is the Adversary? No adversary: Accidental misuse, honest mistakes. The easiest case. Casual adversary: Jailbreaking for fun, social engineering. Medium difficulty. Sophisticated adversary: State actors, determined attackers with resources. The hard case. What Are We Protecting? Users from the model: Preventing harmful outputs. The model from users: Preventing extraction, manipulation, prompt injection. Society from the system: Preventing large-scale harms (economic disruption, disinfo). The future from the present: Preventing lock-in, power concentration, existential risk. Safety claims without a threat model are empty. A system "safe" for internal research may be wildly unsafe for public deployment. ...

What Is RL, Why Is It So Hard, and How Does It Go Wrong?

Prior reading: Gradient Descent and Backpropagation | Loss Functions and Spaces The Setup An agent takes actions in an environment, receives rewards, and learns a policy $\pi(a|s)$ that maximizes expected cumulative reward. $$\max_\pi \mathbb{E}\left[\sum_{t=0}^T \gamma^t r_t\right]$$ Where Are the Neural Nets? Neural networks serve as function approximators for things that are too complex to represent exactly: Policy network: $\pi_\theta(a|s)$ — maps states to action probabilities Value network: $V_\phi(s)$ — estimates expected future reward from a state Q-network: $Q_\psi(s,a)$ — estimates expected future reward from a state-action pair World model (optional): $p_\omega(s'|s,a)$ — predicts next state Without neural nets, RL only works for tiny state spaces (tabular methods). Neural nets let it scale to images, language, and continuous control. ...