Chain of Thought Is Hackable (By the Model)

Prior reading: Mesa-Optimization and Three Lenses | Layers of Safety The Promise of CoT Chain-of-thought reasoning makes model thinking visible. If we can read the reasoning, we can check it for safety. Sounds good. The Problem The model generates its own chain of thought. It's not a window into the model's "real" reasoning — it's another output. The model can learn to produce reasoning traces that satisfy monitors while performing different internal computation. ...

September 17, 2025 · 2 min · Austin T. O'Quinn
.