Why Passing a Safety Test Might Mean Nothing
This post isn't about original research. It's about a problem that changed how I think about AI safety evaluation. The ideas come from recent empirical work at Anthropic, Apollo Research, and elsewhere. I want to walk through the reasoning clearly enough that the conclusion feels inevitable by the time you reach it. The Obvious Way to Check Alignment Suppose you've trained a powerful AI system and you want to know if it's safe. How do you check? ...