Probably Aligned

The Specification and Language Problem

Prior reading: What Are Formal Methods? | Model Checking and Formal Specification The Problem Formal methods can prove a system satisfies a specification. The hard part is writing the specification. "Be helpful and don't cause harm" is not a formal spec. Turning it into one requires resolving ambiguity, edge cases, and value judgments that humans can't even agree on in natural language. Specification as Translation Every spec is a translation from human intent to formal language. Translation is lossy. The gap between what we mean and what we write is the specification problem. ...

Model Checking and Formal Specification

Prior reading: What Are Formal Methods? What Is Model Checking? Given a model of a system and a property expressed in formal logic, model checking exhaustively verifies whether the property holds in all reachable states. Temporal Logic Crash Course LTL (Linear Temporal Logic): "Eventually the system reaches a safe state." $\Diamond \text{safe}$ CTL (Computation Tree Logic): "On all paths, the system never enters an unsafe state." $\forall\Box \neg \text{unsafe}$ These let you express liveness, safety, and fairness properties. Classical Model Checking Works brilliantly for finite-state systems: protocols, hardware, control software. Tools like SPIN, NuSMV, and PRISM are mature. ...

Reachability Analysis for Neural Networks

Prior reading: What Are Formal Methods? | Linear Algebra Proofs of Optimality The Core Question Given a neural network $f$ and an input region $\mathcal{X}$, what is the set of all possible outputs $f(\mathcal{X})$? If we can bound this set, we can prove properties like "no input in this region produces a harmful output." Why This Is Hard Neural networks are piecewise-linear (ReLU) or worse (sigmoid, softmax). The number of linear regions grows exponentially with depth. Exact reachability is NP-hard. ...

What Are Formal Methods in AI Safety?

What Are Formal Methods? Formal methods give mathematical guarantees about system behavior. Instead of testing examples and hoping they generalize, you prove properties hold for all inputs in a defined space. How Other Safety Approaches Work Empirical testing: Run inputs, check outputs. Only covers what you test. Red-teaming: Adversarial search for failures. Better coverage, still incomplete. RLHF / Constitutional AI: Shape behavior through training signal. No guarantees on out-of-distribution. Interpretability: Understand what the model learned. Doesn't directly constrain behavior. What Formal Methods Add Soundness: If the proof says "this property holds," it holds. Period. Completeness trade-offs: Sound methods may be conservative — they might say "I can't prove it's safe" when it actually is. Scalability: The core challenge. Formal methods work beautifully on small systems. Neural networks have millions of parameters. The Gap Why aren't we just doing formal verification on everything? Because: ...

Decision Theory for AI Safety

Prior reading: Game Theory for AI Safety Why Decision Theory Matters Every AI system that takes actions is implicitly using a decision theory — a framework for choosing among options given beliefs and preferences. The choice of decision theory determines: Whether the AI cooperates or defects in strategic situations Whether it's manipulable or manipulation-resistant Whether it takes catastrophic gambles or plays it safe How it reasons about its own future behavior Decision theory isn't just philosophy. It's the operating system of agency. ...

Game Theory for AI Safety

Why Game Theory Matters for AI Safety Most AI safety problems are not single-agent optimization problems. They're multi-agent strategic interactions: Nations deciding whether to regulate (→ see competitive-dynamics post) Companies deciding how much to invest in safety Models interacting with users, other models, and themselves across time Researchers deciding what to publish Game theory is the math of strategic interaction. If you don't know it, you can't reason precisely about any of these. ...

Why Sparsity, and How Do We Get It?

Prior reading: Gradient Descent and Backpropagation What Is Sparsity? A sparse representation has mostly zeros. Instead of every neuron firing for every input, only a small subset activates. Why Sparsity Is Good Interpretability: If only 50 of 10,000 features fire, you have a chance of understanding what they represent Efficiency: Sparse computations are cheaper Generalization: Sparse models tend to overfit less — they can't memorize with fewer active parameters Disentanglement: Sparse features tend to correspond to more independent, meaningful concepts How We Achieve Sparsity Architectural: ...

What Is RL, Why Is It So Hard, and How Does It Go Wrong?

Prior reading: Gradient Descent and Backpropagation | Loss Functions and Spaces The Setup An agent takes actions in an environment, receives rewards, and learns a policy $\pi(a|s)$ that maximizes expected cumulative reward. $$\max_\pi \mathbb{E}\left[\sum_{t=0}^T \gamma^t r_t\right]$$ Where Are the Neural Nets? Neural networks serve as function approximators for things that are too complex to represent exactly: Policy network: $\pi_\theta(a|s)$ — maps states to action probabilities Value network: $V_\phi(s)$ — estimates expected future reward from a state Q-network: $Q_\psi(s,a)$ — estimates expected future reward from a state-action pair World model (optional): $p_\omega(s'|s,a)$ — predicts next state Without neural nets, RL only works for tiny state spaces (tabular methods). Neural nets let it scale to images, language, and continuous control. ...

Loss Functions, Decision Boundaries, Activation Spaces, and Why MSE

Prior reading: Gradient Descent and Backpropagation Three Ways to Look at a Model Loss surface: The landscape over parameter space. What the optimizer sees. Decision boundary: The surface in input space that separates classes. What the user sees. Activation space: The internal geometry of learned representations. What the model "thinks." These are different views of the same object, but they behave differently. Which Are Data-Dependent? Loss surface: Entirely data-dependent. Change the data, change the landscape. Decision boundary: Data-dependent through training, but fixed at inference. Activation space: Shaped by data and architecture jointly. The architecture constrains which representations are possible; the data selects among them. How They Relate The loss function defines the objective. Gradient descent reshapes the decision boundary to minimize loss. The activation space is the intermediate computation that makes the decision boundary expressible. ...

Linear Algebra Proofs of Optimality for Gradient Descent

Prior reading: Gradient Descent and Backpropagation When Gradient Descent Is Provably Optimal For convex functions, gradient descent converges to the global minimum. For strongly convex functions, it converges exponentially fast. The proofs are clean linear algebra. The Convex Case A function $f$ is convex if for all $x, y$: $$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)$$ This means every local minimum is a global minimum. Gradient descent can't get stuck. ...