Loss Functions, Decision Boundaries, Activation Spaces, and Why MSE

Prior reading: Gradient Descent and Backpropagation

Three Ways to Look at a Model

Loss surface: The landscape over parameter space. What the optimizer sees.
Decision boundary: The surface in input space that separates classes. What the user sees.
Activation space: The internal geometry of learned representations. What the model "thinks."

These are different views of the same object, but they behave differently.

Which Are Data-Dependent?

Loss surface: Entirely data-dependent. Change the data, change the landscape.
Decision boundary: Data-dependent through training, but fixed at inference.
Activation space: Shaped by data and architecture jointly. The architecture constrains which representations are possible; the data selects among them.

How They Relate

The loss function defines the objective. Gradient descent reshapes the decision boundary to minimize loss. The activation space is the intermediate computation that makes the decision boundary expressible.

Why Mini-Batch Works

Mini-batch SGD estimates the full gradient from a subset. Why is this okay?

The expected gradient over mini-batches equals the full gradient (unbiased)
The variance adds noise that acts as regularization
Smaller batches = more noise = flatter minima = better generalization
Larger batches = less noise = sharper minima = faster convergence but potentially worse generalization

The mini-batch size is a bias-variance trade-off on the gradient estimate.

Deep Dive: Why MSE Is So Good

Mean squared error: $\mathcal{L} = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2$

It's the default loss for regression. Why?

Statistical: Under Gaussian noise, MSE is the maximum likelihood estimator. If your errors are normally distributed, MSE is provably optimal.

Geometric: MSE minimizes the Euclidean distance between predictions and targets. Euclidean distance is the "natural" distance in flat space.

Optimization: MSE is smooth, differentiable everywhere, and convex in linear models. The gradient is clean: $\nabla = -2(y - \hat{y})$. No discontinuities, no plateaus.

Intuition: MSE punishes large errors quadratically. An error of 10 is 100x worse than an error of 1. This makes it aggressive about reducing big mistakes — which is usually what you want.

When MSE Isn't Good

Outliers: That quadratic penalty means outliers dominate the loss. One bad data point can ruin the fit.
Non-Gaussian noise: If errors are heavy-tailed, MSE overweights extreme values. Use MAE or Huber loss instead.
Classification: MSE on class labels doesn't produce calibrated probabilities. Use cross-entropy.
Structured outputs: When the output space has non-Euclidean geometry (rotations, distributions, sequences), Euclidean distance is wrong.

The Disconnect

You can have low loss but bad decision boundaries (overfitting). You can have clean activation spaces but brittle decision boundaries (adversarial vulnerability). These views don't always agree — and the disagreements are where safety problems hide.

Three Ways to Look at a Model#

Which Are Data-Dependent?#

How They Relate#

Why Mini-Batch Works#

Deep Dive: Why MSE Is So Good#

When MSE Isn't Good#

The Disconnect#