Decision Theory for AI Safety

Prior reading: Game Theory for AI Safety Why Decision Theory Matters Every AI system that takes actions is implicitly using a decision theory — a framework for choosing among options given beliefs and preferences. The choice of decision theory determines: Whether the AI cooperates or defects in strategic situations Whether it's manipulable or manipulation-resistant Whether it takes catastrophic gambles or plays it safe How it reasons about its own future behavior Decision theory isn't just philosophy. It's the operating system of agency. ...

April 9, 2025 · 5 min · Austin T. O'Quinn

Game Theory for AI Safety

Why Game Theory Matters for AI Safety Most AI safety problems are not single-agent optimization problems. They're multi-agent strategic interactions: Nations deciding whether to regulate (→ see competitive-dynamics post) Companies deciding how much to invest in safety Models interacting with users, other models, and themselves across time Researchers deciding what to publish Game theory is the math of strategic interaction. If you don't know it, you can't reason precisely about any of these. ...

March 26, 2025 · 5 min · Austin T. O'Quinn

Why Sparsity, and How Do We Get It?

Prior reading: Gradient Descent and Backpropagation What Is Sparsity? A sparse representation has mostly zeros. Instead of every neuron firing for every input, only a small subset activates. Why Sparsity Is Good Interpretability: If only 50 of 10,000 features fire, you have a chance of understanding what they represent Efficiency: Sparse computations are cheaper Generalization: Sparse models tend to overfit less — they can't memorize with fewer active parameters Disentanglement: Sparse features tend to correspond to more independent, meaningful concepts How We Achieve Sparsity Architectural: ...

March 12, 2025 · 1 min · Austin T. O'Quinn

What Is RL, Why Is It So Hard, and How Does It Go Wrong?

Prior reading: Gradient Descent and Backpropagation | Loss Functions and Spaces The Setup An agent takes actions in an environment, receives rewards, and learns a policy $\pi(a|s)$ that maximizes expected cumulative reward. $$\max_\pi \mathbb{E}\left[\sum_{t=0}^T \gamma^t r_t\right]$$ Where Are the Neural Nets? Neural networks serve as function approximators for things that are too complex to represent exactly: Policy network: $\pi_\theta(a|s)$ — maps states to action probabilities Value network: $V_\phi(s)$ — estimates expected future reward from a state Q-network: $Q_\psi(s,a)$ — estimates expected future reward from a state-action pair World model (optional): $p_\omega(s'|s,a)$ — predicts next state Without neural nets, RL only works for tiny state spaces (tabular methods). Neural nets let it scale to images, language, and continuous control. ...

February 26, 2025 · 3 min · Austin T. O'Quinn

Loss Functions, Decision Boundaries, Activation Spaces, and Why MSE

Prior reading: Gradient Descent and Backpropagation Three Ways to Look at a Model Loss surface: The landscape over parameter space. What the optimizer sees. Decision boundary: The surface in input space that separates classes. What the user sees. Activation space: The internal geometry of learned representations. What the model "thinks." These are different views of the same object, but they behave differently. Which Are Data-Dependent? Loss surface: Entirely data-dependent. Change the data, change the landscape. Decision boundary: Data-dependent through training, but fixed at inference. Activation space: Shaped by data and architecture jointly. The architecture constrains which representations are possible; the data selects among them. How They Relate The loss function defines the objective. Gradient descent reshapes the decision boundary to minimize loss. The activation space is the intermediate computation that makes the decision boundary expressible. ...

February 12, 2025 · 3 min · Austin T. O'Quinn

Linear Algebra Proofs of Optimality for Gradient Descent

Prior reading: Gradient Descent and Backpropagation When Gradient Descent Is Provably Optimal For convex functions, gradient descent converges to the global minimum. For strongly convex functions, it converges exponentially fast. The proofs are clean linear algebra. The Convex Case A function $f$ is convex if for all $x, y$: $$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)$$ This means every local minimum is a global minimum. Gradient descent can't get stuck. ...

January 28, 2025 · 2 min · Austin T. O'Quinn

Gradient Descent, Backpropagation, and the Misconceptions That Tripped Me Up

This post starts from the ordinary derivative and builds to gradient descent for neural networks. If you already know multivariable calculus, you can skip to Why the Gradient is Steepest. If you're here for the ML connection, skip to Applied to Machine Learning. But I'd encourage reading the whole thing — several of the "obvious" steps are where my own misconceptions lived. Starting from One Dimension The Derivative as a Rate You have a function $f(x)$. The derivative $f'(x)$ tells you: if you nudge $x$ by a tiny amount $\Delta x$, how much does $f$ change? ...

January 15, 2025 · 36 min · Austin T. O'Quinn
.