Probably Aligned

Research notes on AI safety from Austin T. O'Quinn, PhD student at The Ohio State University. Formal methods, AI systems, interpretability, policy, and philosophy.

Depth-Robust Safety: What Happens When You Truncate a Language Model

This isn't a jailbreak. Nobody is crafting adversarial prompts. It's an engineer deploying a model efficiently — and maybe, without realizing it, stripping out the safety training. The question Early exit is the idea that you can speed up a language model by stopping computation partway through the network. A model like Mistral-7B has 32 layers. If the model is "confident enough" by layer 27, you skip the last 5 layers and save 15% of the compute. Systems like CALM and LayerSkip use this in production, skipping 30–50% of layers while maintaining quality. ...

What Happens to a Neural Network's Geometry When You Change How It Learns?

Or: the same architecture + the same data + different learning algorithms = radically different internal structure A gap in the Platonic Representation Hypothesis The Platonic Representation Hypothesis (Huh et al., ICML 2024) claims that different neural networks converge toward the same internal representation of reality. They tested this across dozens of architectures — CNNs, ViTs, language models — and found increasing alignment as models get bigger. It's a compelling result. But every single model they tested was trained with backpropagation. ...

Geometric Similarity Is Blind to Computational Structure

This post starts with a simple question — how would you tell if two neural networks learned the same thing? — and builds to a case where the standard answer is dangerously wrong. How would you compare two networks? Suppose you train two neural networks on the same task from different random initializations, and both get 99% accuracy. Did they learn the same thing? You can't just compare the raw activation values. To see why, think about a simpler example. Imagine two spreadsheets tracking student performance. One has columns [math_score, reading_score]. The other has columns [total_score, score_difference]. Both contain the same information — you can convert between them with simple arithmetic — but the raw numbers look completely different. A student with (90, 80) in the first spreadsheet would be (170, 10) in the second. ...

Perfect Shields Create Unsafe Policies

This post is about a paradox in safe reinforcement learning: the better your safety mechanism works during training, the less safe the trained agent might be without it. What's a safety shield? In reinforcement learning, an agent takes actions in an environment, receives rewards, and learns a policy — a mapping from situations to actions. The goal is to maximize cumulative reward. The problem is that during training (and sometimes after), the agent might do dangerous things. ...

Probabilistic Security: Great Against Accidents, Useless Against Attackers

Prior reading: Jailbreaking | Reachability Analysis The Setup Suppose a model has a catastrophic failure mode — some input that causes it to produce a truly dangerous output. And suppose the probability that a random prompt triggers this failure is $10^{-100}$. Is this safe? It Depends Entirely on the Threat Model Good-Faith User (Random Inputs) If your users are cooperative — they're trying to use the model correctly and might occasionally stumble into bad prompts by accident — then $10^{-100}$ is absurdly safe. No one will ever randomly type the one prompt in $10^{100}$ that breaks the model. The sun will burn out first. ...

The AI Threat Landscape: What 'Safe' Means and What We're Afraid Of

Prior reading: Mesa-Optimization and Three Lenses | Game Theory for AI Safety Part I: What Does "Safe" Even Mean? "Make AI safe" is meaningless without specifying: safe for whom, against what threat, under what conditions? Who Is the User? Public: Lowest common denominator. Must handle naive, careless, and adversarial users simultaneously. Internal / enterprise: Can assume some training, access controls, and monitoring. Knowledgeable human: Researchers, developers. Different failure modes matter. Who Is the Adversary? No adversary: Accidental misuse, honest mistakes. The easiest case. Casual adversary: Jailbreaking for fun, social engineering. Medium difficulty. Sophisticated adversary: State actors, determined attackers with resources. The hard case. What Are We Protecting? Users from the model: Preventing harmful outputs. The model from users: Preventing extraction, manipulation, prompt injection. Society from the system: Preventing large-scale harms (economic disruption, disinfo). The future from the present: Preventing lock-in, power concentration, existential risk. Safety claims without a threat model are empty. A system "safe" for internal research may be wildly unsafe for public deployment. ...

Why Passing a Safety Test Might Mean Nothing

This post isn't about original research. It's about a problem that changed how I think about AI safety evaluation. The ideas come from recent empirical work at Anthropic, Apollo Research, and elsewhere. I want to walk through the reasoning clearly enough that the conclusion feels inevitable by the time you reach it. The Obvious Way to Check Alignment Suppose you've trained a powerful AI system and you want to know if it's safe. How do you check? ...

Stability of Safety

Prior reading: Gradient Descent and Backpropagation | What Are Formal Methods? | Reachability Analysis Safety as a Point in Parameter Space A model's behavior is a function of its parameters $\theta$. "Safe behavior" corresponds to a region $\mathcal{S}$ in parameter space. Training moves $\theta$ through this space. Gradient at a Point The gradient $\nabla_\theta \mathcal{L}$ tells us which direction training pushes the model. If this direction points out of $\mathcal{S}$, a single update can break safety. ...

Jailbreaking: Transference, Universality, and Why Defenses May Be Impossible

Prior reading: Safety as Capability Elicitation | Reachability Analysis | Platonic Forms What Is a Jailbreak? A jailbreak is an input that causes a safety-trained model to produce an output its safety training was designed to prevent. The refusal boundary is a decision surface in input space — a jailbreak is a point on the wrong side of it that the model fails to classify correctly. More precisely: if a model has been trained to refuse requests in some set $\mathcal{D}{\text{dangerous}}$, a jailbreak is an input $x$ such that $x$ is semantically in $\mathcal{D}{\text{dangerous}}$ but the model's refusal classifier maps it outside. ...

Safety Training as Capability Elicitation

Prior reading: When Safety Training Backfires | Probing The Paradox To refuse a dangerous request, a model must first understand what's being asked well enough to recognize it as dangerous. Training a model to filter bioweapon synthesis queries requires the model to sharpen its internal representation of bioweapon synthesis — not blur it. The safety mechanism is a drug-sniffing dog: you have to teach it what drugs smell like. The Mechanism Consider what refusal training actually does in representation space. Before safety fine-tuning, a model may have a vague, diffuse representation of some dangerous domain — enough to generate mediocre outputs if prompted, but not deeply structured. ...