Stability of Safety

Prior reading: Gradient Descent and Backpropagation | What Are Formal Methods? | Reachability Analysis

Safety as a Point in Parameter Space

A model's behavior is a function of its parameters $\theta$. "Safe behavior" corresponds to a region $\mathcal{S}$ in parameter space. Training moves $\theta$ through this space.

Gradient at a Point

The gradient $\nabla_\theta \mathcal{L}$ tells us which direction training pushes the model. If this direction points out of $\mathcal{S}$, a single update can break safety.

Curvature of the Region

The Hessian $H$ describes how the loss surface curves. This determines:

Flat regions: Large step sizes are safe. Small gradients won't push you out.
Sharp regions: Even small steps can cross a safety boundary.
Saddle points: Safety may be stable in some directions, unstable in others.

Step Size Implications

If you're in a safe region with high curvature near the boundary, your learning rate must be small enough that $|\eta \nabla_\theta \mathcal{L}|$ doesn't overshoot.

$$\eta < \frac{d(\theta, \partial\mathcal{S})}{|\nabla_\theta \mathcal{L}|}$$

This is a rough bound — the real geometry is more complex.

Live Learning Makes This Worse

Deployed models that continue learning from user interaction face:

Non-stationary data distributions
Adversarial inputs that push toward boundary
No opportunity to pause and verify

Implications

Safety isn't just about reaching a safe point — it's about staying there under perturbation. Formal methods need to account for the dynamics of training, not just the snapshot.

There's a related but distinct problem: even if safety is stable during training, the training regime itself may prevent safety from being learned. See Perfect Shields Create Unsafe Policies — a runtime safety shield can sever the feedback loop that the policy needs to internalize safe behavior.

Safety as a Point in Parameter Space#

Gradient at a Point#

Curvature of the Region#

Step Size Implications#

Live Learning Makes This Worse#

Implications#