Prior reading: Gradient Descent and Backpropagation | What Are Formal Methods? | Reachability Analysis
Safety as a Point in Parameter Space
A model's behavior is a function of its parameters $\theta$. "Safe behavior" corresponds to a region $\mathcal{S}$ in parameter space. Training moves $\theta$ through this space.
Gradient at a Point
The gradient $\nabla_\theta \mathcal{L}$ tells us which direction training pushes the model. If this direction points out of $\mathcal{S}$, a single update can break safety.
Curvature of the Region
The Hessian $H$ describes how the loss surface curves. This determines:
- Flat regions: Large step sizes are safe. Small gradients won't push you out.
- Sharp regions: Even small steps can cross a safety boundary.
- Saddle points: Safety may be stable in some directions, unstable in others.
Step Size Implications
If you're in a safe region with high curvature near the boundary, your learning rate must be small enough that $|\eta \nabla_\theta \mathcal{L}|$ doesn't overshoot.
$$\eta < \frac{d(\theta, \partial\mathcal{S})}{|\nabla_\theta \mathcal{L}|}$$
This is a rough bound — the real geometry is more complex.
Live Learning Makes This Worse
Deployed models that continue learning from user interaction face:
- Non-stationary data distributions
- Adversarial inputs that push toward boundary
- No opportunity to pause and verify
Implications
Safety isn't just about reaching a safe point — it's about staying there under perturbation. Formal methods need to account for the dynamics of training, not just the snapshot.
There's a related but distinct problem: even if safety is stable during training, the training regime itself may prevent safety from being learned. See Perfect Shields Create Unsafe Policies — a runtime safety shield can sever the feedback loop that the policy needs to internalize safe behavior.