Why Sparsity, and How Do We Get It?

Prior reading: Gradient Descent and Backpropagation

What Is Sparsity?

A sparse representation has mostly zeros. Instead of every neuron firing for every input, only a small subset activates.

Why Sparsity Is Good

Interpretability: If only 50 of 10,000 features fire, you have a chance of understanding what they represent
Efficiency: Sparse computations are cheaper
Generalization: Sparse models tend to overfit less — they can't memorize with fewer active parameters
Disentanglement: Sparse features tend to correspond to more independent, meaningful concepts

How We Achieve Sparsity

Architectural:

ReLU activations naturally produce zeros (negative inputs → 0)
Mixture of Experts: only a subset of parameters active per input
Dropout during training: random sparsity as regularization

Training-based:

L1 regularization: $\mathcal{L} + \lambda |\theta|_1$ directly penalizes non-zero weights
Sparse autoencoders: train to reconstruct with a sparsity penalty on the bottleneck
Pruning: train dense, then remove small weights

Emergent:

Large models often develop sparse activation patterns naturally
Superposition: models pack more features than dimensions by using sparse, overlapping codes

The Safety Connection

Sparse, interpretable features are easier to monitor, verify, and constrain. If you can identify which features correspond to dangerous capabilities, you can intervene. Dense, entangled representations make this nearly impossible.

What Is Sparsity?#

Why Sparsity Is Good#

How We Achieve Sparsity#

The Safety Connection#

What Is Sparsity?

Why Sparsity Is Good

How We Achieve Sparsity

The Safety Connection