Prior reading: Gradient Descent and Backpropagation
Three Ways to Look at a Model
- Loss surface: The landscape over parameter space. What the optimizer sees.
- Decision boundary: The surface in input space that separates classes. What the user sees.
- Activation space: The internal geometry of learned representations. What the model "thinks."
These are different views of the same object, but they behave differently.
Which Are Data-Dependent?
- Loss surface: Entirely data-dependent. Change the data, change the landscape.
- Decision boundary: Data-dependent through training, but fixed at inference.
- Activation space: Shaped by data and architecture jointly. The architecture constrains which representations are possible; the data selects among them.
How They Relate
The loss function defines the objective. Gradient descent reshapes the decision boundary to minimize loss. The activation space is the intermediate computation that makes the decision boundary expressible.
Why Mini-Batch Works
Mini-batch SGD estimates the full gradient from a subset. Why is this okay?
- The expected gradient over mini-batches equals the full gradient (unbiased)
- The variance adds noise that acts as regularization
- Smaller batches = more noise = flatter minima = better generalization
- Larger batches = less noise = sharper minima = faster convergence but potentially worse generalization
The mini-batch size is a bias-variance trade-off on the gradient estimate.
Deep Dive: Why MSE Is So Good
Mean squared error: $\mathcal{L} = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2$
It's the default loss for regression. Why?
Statistical: Under Gaussian noise, MSE is the maximum likelihood estimator. If your errors are normally distributed, MSE is provably optimal.
Geometric: MSE minimizes the Euclidean distance between predictions and targets. Euclidean distance is the "natural" distance in flat space.
Optimization: MSE is smooth, differentiable everywhere, and convex in linear models. The gradient is clean: $\nabla = -2(y - \hat{y})$. No discontinuities, no plateaus.
Intuition: MSE punishes large errors quadratically. An error of 10 is 100x worse than an error of 1. This makes it aggressive about reducing big mistakes — which is usually what you want.
When MSE Isn't Good
- Outliers: That quadratic penalty means outliers dominate the loss. One bad data point can ruin the fit.
- Non-Gaussian noise: If errors are heavy-tailed, MSE overweights extreme values. Use MAE or Huber loss instead.
- Classification: MSE on class labels doesn't produce calibrated probabilities. Use cross-entropy.
- Structured outputs: When the output space has non-Euclidean geometry (rotations, distributions, sequences), Euclidean distance is wrong.
The Disconnect
You can have low loss but bad decision boundaries (overfitting). You can have clean activation spaces but brittle decision boundaries (adversarial vulnerability). These views don't always agree — and the disagreements are where safety problems hide.