Probably Aligned

Research notes on AI safety from Austin T. O'Quinn, PhD student at The Ohio State University. Formal methods, AI systems, interpretability, policy, and philosophy.

What Happens to a Neural Network's Geometry When You Change How It Learns?

Or: the same architecture + the same data + different learning algorithms = radically different internal structure A gap in the Platonic Representation Hypothesis The Platonic Representation Hypothesis (Huh et al., ICML 2024) claims that different neural networks converge toward the same internal representation of reality. They tested this across dozens of architectures — CNNs, ViTs, language models — and found increasing alignment as models get bigger. It's a compelling result. But every single model they tested was trained with backpropagation. ...

April 2, 2026 · 11 min · Austin T. O'Quinn

CKA Says Your Models Are the Same. They Aren't.

This post is about a metric that almost everyone in representation learning trusts — and a setting where it completely breaks. If you haven't heard of CKA, that's fine. I'll define it. If you have, you probably trust it more than you should. What is CKA and why should you care? When researchers want to know whether two neural networks have learned "the same thing," they need a way to compare their internal representations. You can't just compare the raw numbers — two networks can represent the same information in completely different coordinate systems. What you want is a metric that asks: do these two networks organize information the same way, ignoring superficial differences like rotation and scale? ...

April 1, 2026 · 10 min · Austin T. O'Quinn

Perfect Shields Create Unsafe Policies

This post is about a paradox in safe reinforcement learning: the better your safety mechanism works during training, the less safe the trained agent might be without it. What's a safety shield? In reinforcement learning, an agent takes actions in an environment, receives rewards, and learns a policy — a mapping from situations to actions. The goal is to maximize cumulative reward. The problem is that during training (and sometimes after), the agent might do dangerous things. ...

April 1, 2026 · 10 min · Austin T. O'Quinn

Probabilistic Security: Great Against Accidents, Useless Against Attackers

Prior reading: Jailbreaking | Reachability Analysis The Setup Suppose a model has a catastrophic failure mode — some input that causes it to produce a truly dangerous output. And suppose the probability that a random prompt triggers this failure is $10^{-100}$. Is this safe? It Depends Entirely on the Threat Model Good-Faith User (Random Inputs) If your users are cooperative — they're trying to use the model correctly and might occasionally stumble into bad prompts by accident — then $10^{-100}$ is absurdly safe. No one will ever randomly type the one prompt in $10^{100}$ that breaks the model. The sun will burn out first. ...

February 4, 2026 · 3 min · Austin T. O'Quinn

The AI Threat Landscape: What 'Safe' Means and What We're Afraid Of

Prior reading: Mesa-Optimization and Three Lenses | Game Theory for AI Safety Part I: What Does "Safe" Even Mean? "Make AI safe" is meaningless without specifying: safe for whom, against what threat, under what conditions? Who Is the User? Public: Lowest common denominator. Must handle naive, careless, and adversarial users simultaneously. Internal / enterprise: Can assume some training, access controls, and monitoring. Knowledgeable human: Researchers, developers. Different failure modes matter. Who Is the Adversary? No adversary: Accidental misuse, honest mistakes. The easiest case. Casual adversary: Jailbreaking for fun, social engineering. Medium difficulty. Sophisticated adversary: State actors, determined attackers with resources. The hard case. What Are We Protecting? Users from the model: Preventing harmful outputs. The model from users: Preventing extraction, manipulation, prompt injection. Society from the system: Preventing large-scale harms (economic disruption, disinfo). The future from the present: Preventing lock-in, power concentration, existential risk. Safety claims without a threat model are empty. A system "safe" for internal research may be wildly unsafe for public deployment. ...

January 21, 2026 · 4 min · Austin T. O'Quinn

Detectability of Testing

Prior reading: Probing | Reachability Analysis | Mesa-Optimization and Three Lenses The Problem If a model can detect when it's being tested, it can behave differently during evaluation than in deployment. This undermines every safety benchmark. Structural vs. Semantic Detection Structural detection: The test environment is literally different — different API calls, different system prompts, different input distributions. The model picks up on artifacts of the testing setup. Semantic detection: The model understands the meaning of being tested. It recognizes evaluation-style questions, red-team prompts, or benchmark formats from training data. ...

January 7, 2026 · 2 min · Austin T. O'Quinn

Stability of Safety

Prior reading: Gradient Descent and Backpropagation | What Are Formal Methods? | Reachability Analysis Safety as a Point in Parameter Space A model's behavior is a function of its parameters $\theta$. "Safe behavior" corresponds to a region $\mathcal{S}$ in parameter space. Training moves $\theta$ through this space. Gradient at a Point The gradient $\nabla_\theta \mathcal{L}$ tells us which direction training pushes the model. If this direction points out of $\mathcal{S}$, a single update can break safety. ...

December 17, 2025 · 2 min · Austin T. O'Quinn

Jailbreaking: Transference, Universality, and Why Defenses May Be Impossible

Prior reading: Safety as Capability Elicitation | Reachability Analysis | Platonic Forms What Is a Jailbreak? A jailbreak is an input that causes a safety-trained model to produce an output its safety training was designed to prevent. The refusal boundary is a decision surface in input space — a jailbreak is a point on the wrong side of it that the model fails to classify correctly. More precisely: if a model has been trained to refuse requests in some set $\mathcal{D}{\text{dangerous}}$, a jailbreak is an input $x$ such that $x$ is semantically in $\mathcal{D}{\text{dangerous}}$ but the model's refusal classifier maps it outside. ...

December 3, 2025 · 7 min · Austin T. O'Quinn

Safety Training as Capability Elicitation

Prior reading: When Safety Training Backfires | Probing The Paradox To refuse a dangerous request, a model must first understand what's being asked well enough to recognize it as dangerous. Training a model to filter bioweapon synthesis queries requires the model to sharpen its internal representation of bioweapon synthesis — not blur it. The safety mechanism is a drug-sniffing dog: you have to teach it what drugs smell like. The Mechanism Consider what refusal training actually does in representation space. Before safety fine-tuning, a model may have a vague, diffuse representation of some dangerous domain — enough to generate mediocre outputs if prompted, but not deeply structured. ...

November 12, 2025 · 4 min · Austin T. O'Quinn

When Safety Training Backfires

Prior reading: A Survey of Alignment Techniques | Probing The Setup Human-based RL fine-tuning (RLHF) works by rewarding responses humans prefer and penalizing ones they don't. Usually this improves behavior. But sometimes suppressing a true but inappropriate response teaches the model the wrong lesson. The Problem: Value Attribution Is Hard When a human rates a response poorly, the training signal says "this was bad." It doesn't say why it was bad. Consider sensitive but objective topics — crime statistics disaggregated by race, for instance. A response citing real data might be rated poorly because: ...

October 29, 2025 · 4 min · Austin T. O'Quinn
.