The AI Threat Landscape: What 'Safe' Means and What We're Afraid Of

Prior reading: Mesa-Optimization and Three Lenses | Game Theory for AI Safety Part I: What Does "Safe" Even Mean? "Make AI safe" is meaningless without specifying: safe for whom, against what threat, under what conditions? Who Is the User? Public: Lowest common denominator. Must handle naive, careless, and adversarial users simultaneously. Internal / enterprise: Can assume some training, access controls, and monitoring. Knowledgeable human: Researchers, developers. Different failure modes matter. Who Is the Adversary? No adversary: Accidental misuse, honest mistakes. The easiest case. Casual adversary: Jailbreaking for fun, social engineering. Medium difficulty. Sophisticated adversary: State actors, determined attackers with resources. The hard case. What Are We Protecting? Users from the model: Preventing harmful outputs. The model from users: Preventing extraction, manipulation, prompt injection. Society from the system: Preventing large-scale harms (economic disruption, disinfo). The future from the present: Preventing lock-in, power concentration, existential risk. Safety claims without a threat model are empty. A system "safe" for internal research may be wildly unsafe for public deployment. ...

January 21, 2026 · 4 min · Austin T. O'Quinn

Layers of Safety: From Data to Deployment

Prior reading: What Are Formal Methods? | What Is RL? Safety Is a Stack No single technique makes AI safe. Safety comes from layers, each catching different failure modes. The Layers 1. Clean Data Garbage in, garbage out. Data quality determines the prior the model learns from. Biased, toxic, or incorrect data embeds those properties in the model. 2. Good Specifications and Benchmarks What are we optimizing for? Goodhart's Law applies: once the benchmark becomes the target, it stops being a good benchmark. Benchmark gaming is optimization against your spec. ...

August 20, 2025 · 2 min · Austin T. O'Quinn
.