Stability of Safety
Prior reading: Gradient Descent and Backpropagation | What Are Formal Methods? | Reachability Analysis Safety as a Point in Parameter Space A model's behavior is a function of its parameters $\theta$. "Safe behavior" corresponds to a region $\mathcal{S}$ in parameter space. Training moves $\theta$ through this space. Gradient at a Point The gradient $\nabla_\theta \mathcal{L}$ tells us which direction training pushes the model. If this direction points out of $\mathcal{S}$, a single update can break safety. ...