Prior reading: When Safety Training Backfires | Probing

The Paradox

To refuse a dangerous request, a model must first understand what's being asked well enough to recognize it as dangerous. Training a model to filter bioweapon synthesis queries requires the model to sharpen its internal representation of bioweapon synthesis — not blur it.

The safety mechanism is a drug-sniffing dog: you have to teach it what drugs smell like.

The Mechanism

Consider what refusal training actually does in representation space. Before safety fine-tuning, a model may have a vague, diffuse representation of some dangerous domain — enough to generate mediocre outputs if prompted, but not deeply structured.

Refusal training introduces a classification boundary: this is a dangerous request / this is not. To draw that boundary accurately, the model needs a crisper representation of the dangerous domain. The gradient pushes the model toward:

  1. Better recognition of dangerous content (sharper features for detection)
  2. Behavioral suppression of dangerous outputs (a thin layer on top saying "don't")

The behavioral layer is shallow. The capability deepening may be structural.

The Drug-Sniffing Dog Problem

A drug-sniffing dog has a more refined understanding of drug chemistry (via scent) than an untrained dog. The training didn't remove knowledge — it added a behavioral response (alert) on top of enhanced detection.

Similarly, a refusal-trained model may:

  • Have better internal representations of dangerous knowledge than the base model
  • Encode those representations more clearly and accessibly
  • Suppress them only at the output layer, not at the representation layer

This means a jailbreak that bypasses the output-level refusal accesses a more capable model than the base model was.

Why Behavioral Layers Are Thin

Refusal is typically trained via RLHF or DPO — methods that primarily adjust the model's output distribution. They change what the model says, not what the model knows or can compute.

The depth of the safety intervention matters:

  • Output-level refusal: "Don't say this." Easily bypassed.
  • Representation-level suppression: Actually degrade the internal features. Harder to bypass, but also destroys capability you might need (dual-use knowledge).
  • The gap between them: Most current safety training lives at the output level while unintentionally sharpening the representation level.

Analogy to Adversarial Training

In adversarial robustness, training a model to resist adversarial examples makes the model's features more aligned with human-perceptible features — the representations get cleaner, more structured. Adversarial training improves representation quality as a side effect.

Refusal training may do the same for dangerous-domain representations: the features get cleaner and more structured because the model needs them to be in order to refuse accurately.

Experiments

(TODO: Small-scale experiments demonstrating the effect)

Experiment 1: Probing Before and After Refusal Training

  • Take a base model and a refusal-fine-tuned version
  • Train linear probes on internal activations for dangerous-domain knowledge (e.g., chemistry, exploit structure)
  • Measure whether probe accuracy increases after safety training
  • If it does: safety training sharpened the representation

Experiment 2: Jailbreak Capability Comparison

  • Compare the quality of dangerous outputs from:
    • Base model (no refusal training) when directly prompted
    • Refusal-trained model when successfully jailbroken
  • If the jailbroken safety-trained model produces better dangerous outputs: the capability was elicited, not suppressed

Experiment 3: Representation Depth

  • Use activation patching or causal tracing to identify which layers encode dangerous-domain knowledge
  • Compare depth profiles between base and refusal-trained models
  • Hypothesis: refusal training concentrates dangerous knowledge in later layers (closer to output) where refusal can act on it — making it more accessible, not less

Implications

This creates a structural tension in safety engineering:

  • Accurate refusal requires deep understanding of what to refuse
  • Deep understanding means the capability is present and potentially accessible
  • Shallow refusal (just block certain keywords) is easy to bypass and doesn't require capability sharpening — but also doesn't work against sophisticated prompts

The uncomfortable conclusion: the better your safety training is at recognizing dangerous requests, the more capable the underlying model may become at the dangerous task. Safety and capability may be coupled in ways that output-level interventions can't decouple.

This connects to the broader layers-of-safety picture (→ see layers-of-safety post): refusal is one layer, and it may be undermining itself at the representation level. It also connects to when-safety-training-backfires — that post covers suppression causing forgetting or hedging; this one covers the opposite failure mode where safety training amplifies what it tries to suppress.