Prior reading: Gradient Descent and Backpropagation

When Gradient Descent Is Provably Optimal

For convex functions, gradient descent converges to the global minimum. For strongly convex functions, it converges exponentially fast. The proofs are clean linear algebra.

The Convex Case

A function $f$ is convex if for all $x, y$:

$$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)$$

This means every local minimum is a global minimum. Gradient descent can't get stuck.

Convergence Rate

For an $L$-smooth convex function with step size $\eta = 1/L$:

$$f(\theta_T) - f(\theta^) \leq \frac{L |\theta_0 - \theta^|^2}{2T}$$

This is $O(1/T)$ convergence — slow but guaranteed.

Strong Convexity and Linear Convergence

If $f$ is also $\mu$-strongly convex (the Hessian is bounded below by $\mu I$), convergence is exponential:

$$|\theta_T - \theta^|^2 \leq \left(1 - \frac{\mu}{L}\right)^T |\theta_0 - \theta^|^2$$

The condition number $\kappa = L/\mu$ controls the rate.

Why This Matters (and Why It Breaks)

Neural network loss surfaces are not convex. These guarantees don't apply directly. But they build intuition for:

  • Why conditioning matters (ill-conditioned = slow convergence)
  • Why preconditioning helps (Adam approximates natural gradient)
  • What we lose when we leave the convex world

The Gap

The gap between "provably optimal for convex functions" and "works mysteriously well for neural networks" is one of the biggest open questions in deep learning theory.