Archive

2026 ⁷

April ⁴

Depth-Robust Safety: What Happens When You Truncate a Language Model

April 2, 2026 · 8 min · Austin T. O'Quinn

What Happens to a Neural Network's Geometry When You Change How It Learns?

April 2, 2026 · 11 min · Austin T. O'Quinn

Geometric Similarity Is Blind to Computational Structure

April 1, 2026 · 11 min · Austin T. O'Quinn

Perfect Shields Create Unsafe Policies

April 1, 2026 · 9 min · Austin T. O'Quinn

February ¹

Probabilistic Security: Great Against Accidents, Useless Against Attackers

February 4, 2026 · 3 min · Austin T. O'Quinn

January ²

The AI Threat Landscape: What 'Safe' Means and What We're Afraid Of

January 21, 2026 · 4 min · Austin T. O'Quinn

Detectability of Testing

January 7, 2026 · 2 min · Austin T. O'Quinn

2025 ²⁴

December ²

Stability of Safety

December 17, 2025 · 2 min · Austin T. O'Quinn

Jailbreaking: Transference, Universality, and Why Defenses May Be Impossible

December 3, 2025 · 7 min · Austin T. O'Quinn

November ¹

Safety Training as Capability Elicitation

November 12, 2025 · 4 min · Austin T. O'Quinn

October ³

When Safety Training Backfires

October 29, 2025 · 4 min · Austin T. O'Quinn

A Survey of Alignment Techniques and Their Trade-Offs

October 15, 2025 · 7 min · Austin T. O'Quinn

P-Hacking as Optimization: Implications for Safety Benchmarking

October 1, 2025 · 2 min · Austin T. O'Quinn

September ²

Chain of Thought Is Hackable (By the Model)

September 17, 2025 · 2 min · Austin T. O'Quinn

Systems vs. Components: The Chinese Room and ROMs

September 3, 2025 · 2 min · Austin T. O'Quinn

August ²

Layers of Safety: From Data to Deployment

August 20, 2025 · 2 min · Austin T. O'Quinn

Mesa-Optimization and the Optimization Pressure Spectrum

August 6, 2025 · 13 min · Austin T. O'Quinn

July ²

Mechanistic Interpretability: Circuits, Superposition, and Sparse Autoencoders

July 16, 2025 · 8 min · Austin T. O'Quinn

Platonic Forms in Near-Capacity Models

July 2, 2025 · 2 min · Austin T. O'Quinn

June ²

Probing: What Do Models Actually Know?

June 18, 2025 · 2 min · Austin T. O'Quinn

The Specification and Language Problem

June 4, 2025 · 2 min · Austin T. O'Quinn

May ²

Model Checking and Formal Specification

May 21, 2025 · 1 min · Austin T. O'Quinn

Reachability Analysis for Neural Networks

May 7, 2025 · 1 min · Austin T. O'Quinn

April ²

What Are Formal Methods in AI Safety?

April 23, 2025 · 1 min · Austin T. O'Quinn

Decision Theory for AI Safety

April 9, 2025 · 5 min · Austin T. O'Quinn

March ²

Game Theory for AI Safety

March 26, 2025 · 5 min · Austin T. O'Quinn

Why Sparsity, and How Do We Get It?

March 12, 2025 · 1 min · Austin T. O'Quinn

February ²

What Is RL, Why Is It So Hard, and How Does It Go Wrong?

February 26, 2025 · 3 min · Austin T. O'Quinn

Loss Functions, Decision Boundaries, Activation Spaces, and Why MSE

February 12, 2025 · 3 min · Austin T. O'Quinn

January ²

Linear Algebra Proofs of Optimality for Gradient Descent

January 28, 2025 · 2 min · Austin T. O'Quinn

Gradient Descent, Backpropagation, and the Misconceptions That Tripped Me Up

January 15, 2025 · 36 min · Austin T. O'Quinn

2026 7

April 4

Depth-Robust Safety: What Happens When You Truncate a Language Model

What Happens to a Neural Network's Geometry When You Change How It Learns?

Geometric Similarity Is Blind to Computational Structure

Perfect Shields Create Unsafe Policies

February 1

Probabilistic Security: Great Against Accidents, Useless Against Attackers

January 2

The AI Threat Landscape: What 'Safe' Means and What We're Afraid Of

Detectability of Testing

2025 24

December 2

Stability of Safety

Jailbreaking: Transference, Universality, and Why Defenses May Be Impossible

November 1

Safety Training as Capability Elicitation

October 3

When Safety Training Backfires

A Survey of Alignment Techniques and Their Trade-Offs

P-Hacking as Optimization: Implications for Safety Benchmarking

September 2

Chain of Thought Is Hackable (By the Model)

Systems vs. Components: The Chinese Room and ROMs

August 2

Layers of Safety: From Data to Deployment

Mesa-Optimization and the Optimization Pressure Spectrum

July 2

Mechanistic Interpretability: Circuits, Superposition, and Sparse Autoencoders

Platonic Forms in Near-Capacity Models

June 2

Probing: What Do Models Actually Know?

The Specification and Language Problem

May 2

Model Checking and Formal Specification

Reachability Analysis for Neural Networks

April 2

What Are Formal Methods in AI Safety?

Decision Theory for AI Safety

March 2

Game Theory for AI Safety

Why Sparsity, and How Do We Get It?

February 2

What Is RL, Why Is It So Hard, and How Does It Go Wrong?

Loss Functions, Decision Boundaries, Activation Spaces, and Why MSE

January 2

Linear Algebra Proofs of Optimality for Gradient Descent

Gradient Descent, Backpropagation, and the Misconceptions That Tripped Me Up

2026 ⁷

April ⁴

February ¹

January ²

2025 ²⁴

December ²

November ¹

October ³

September ²

August ²

July ²

June ²

May ²

April ²

March ²

February ²

January ²