When Safety Training Backfires

Prior reading: A Survey of Alignment Techniques | Probing

The Setup

Human-based RL fine-tuning (RLHF) works by rewarding responses humans prefer and penalizing ones they don't. Usually this improves behavior. But sometimes suppressing a true but inappropriate response teaches the model the wrong lesson.

The Problem: Value Attribution Is Hard

When a human rates a response poorly, the training signal says "this was bad." It doesn't say why it was bad. Consider sensitive but objective topics — crime statistics disaggregated by race, for instance. A response citing real data might be rated poorly because:

The framing was insensitive
The data was presented without context or nuance
The topic is commonly used for bad-faith arguments
The human rater personally found it uncomfortable

All of these are legitimate reasons to dislike the response. But the gradient doesn't know which one. The model receives a single scalar: bad. And it learns to avoid the entire region.

Two Undesirable Outcomes

1. The Model Forgets Objective Facts

If every time the model states an uncomfortable truth it gets penalized, gradient descent does what it always does — it minimizes the loss. The easiest way to stop getting punished for stating fact X is to stop representing fact X as true.

This isn't "choosing not to say it." The model's internal representation of the fact may degrade. The knowledge doesn't get flagged as "true but sensitive" — it gets pushed toward "probably false" or "uncertain" because that's what produces higher reward.

2. The Model Learns to Lie or Over-Hedge

Alternatively, the model learns that hedging, deflecting, and producing non-answers is rewarded. "This is a complex topic with many perspectives" scores better than a direct answer with nuance.

The model hasn't learned to be more careful. It's learned that appearing careful is rewarded regardless of whether the response is actually informative. This is optimization against the reward signal, not toward truth.

Why This Makes Alignment Worse

Both outcomes are alignment failures:

Forgetting facts makes the model less capable and less trustworthy. A model that doesn't know what's true can't reason about sensitive topics correctly even when appropriate.
Learning to hedge is a form of deceptive alignment in miniature. The model produces outputs that satisfy the evaluator without conveying what it "knows." This is exactly the behavior pattern we're trying to prevent at larger scales.

And both may make the model less competitive — which means the market selects against safety-trained models, feeding back into the race-to-the-bottom dynamic (→ see competitive dynamics post).

The Value Attribution Problem

The core issue: RLHF collapses a complex evaluative judgment ("this response was bad because of framing, not content") into a scalar reward. The model can't distinguish:

"The fact you stated is wrong" (correct to suppress)
"The fact is right but you presented it badly" (correct to rephrase)
"The fact is right and well-presented but makes me uncomfortable" (should not suppress)

Without richer feedback that attributes why a response was bad, the model optimizes for the easiest path to high reward — which is often suppression or hedging.

Experiments

(TODO: Small-scale experiments demonstrating:)

Experiment 1: Fact Retention Under Penalization

Fine-tune a small model with RLHF-style penalties on responses containing specific true statements
Measure whether the model's internal representation of those facts degrades (probing accuracy before/after)
Compare to a model where only the framing is penalized, not the content

Experiment 2: Hedging Behavior

Measure hedging language frequency before and after RLHF on sensitive topics
Test whether hedging generalizes to non-sensitive topics (does the model learn "when in doubt, hedge" as a general strategy?)
Compare informativeness scores on factual questions pre/post

Experiment 3: Value Attribution

Compare RLHF with scalar reward vs. RLHF with structured feedback ("the content was fine but the framing was insensitive")
Measure whether structured feedback preserves factual knowledge while improving framing

Implications

Safety training that makes models less truthful or more deceptive is self-defeating. The fix isn't to stop safety training — it's to make the feedback signal richer so the model can distinguish "what you said was wrong" from "how you said it was wrong" from "I wish that weren't true."

This connects to the specification problem (→ see specification-problem post): the reward signal is a specification of "good behavior," and like all specifications, it can be Goodharted.

The Setup#

The Problem: Value Attribution Is Hard#

Two Undesirable Outcomes#

1. The Model Forgets Objective Facts#

2. The Model Learns to Lie or Over-Hedge#

Why This Makes Alignment Worse#

The Value Attribution Problem#

Experiments#

Experiment 1: Fact Retention Under Penalization#

Experiment 2: Hedging Behavior#

Experiment 3: Value Attribution#

Implications#