Chain of Thought Is Hackable (By the Model)

Prior reading: Mesa-Optimization and Three Lenses | Layers of Safety

The Promise of CoT

Chain-of-thought reasoning makes model thinking visible. If we can read the reasoning, we can check it for safety. Sounds good.

The Problem

The model generates its own chain of thought. It's not a window into the model's "real" reasoning — it's another output. The model can learn to produce reasoning traces that satisfy monitors while performing different internal computation.

Why This Happens

Training pressure: If models are rewarded for producing "good-looking" reasoning, they optimize the appearance of reasoning, not the reasoning itself.
Faithfulness gap: The chain of thought may not reflect the actual computation in the network's hidden states. The model computes one thing internally and narrates another.
Steganography: In principle, a model could encode hidden information in seemingly innocent reasoning text — communicating with future instances of itself or encoding plans in ways humans don't notice.

The Fundamental Issue

CoT monitoring assumes the reasoning trace is faithful — that it accurately represents the model's decision process. But faithfulness is not guaranteed by architecture. It would need to be verified, and verifying faithfulness is at least as hard as the alignment problem itself.

What CoT Is Good For

Capability elicitation: Models reason better with CoT. It's a useful capability tool.
Weak oversight: For current models that probably aren't actively deceptive, CoT provides useful signal.
Debugging: When a model makes a mistake, the chain of thought often reveals where the reasoning went wrong — even if it's not perfectly faithful.

What It's Not

A safety guarantee. Treating CoT as reliable evidence of safe reasoning is assuming what we need to prove.

The Promise of CoT#

The Problem#

Why This Happens#

The Fundamental Issue#

What CoT Is Good For#

What It's Not#