Prior reading: Gradient Descent and Backpropagation
What Is Sparsity?
A sparse representation has mostly zeros. Instead of every neuron firing for every input, only a small subset activates.
Why Sparsity Is Good
- Interpretability: If only 50 of 10,000 features fire, you have a chance of understanding what they represent
- Efficiency: Sparse computations are cheaper
- Generalization: Sparse models tend to overfit less — they can't memorize with fewer active parameters
- Disentanglement: Sparse features tend to correspond to more independent, meaningful concepts
How We Achieve Sparsity
Architectural:
- ReLU activations naturally produce zeros (negative inputs → 0)
- Mixture of Experts: only a subset of parameters active per input
- Dropout during training: random sparsity as regularization
Training-based:
- L1 regularization: $\mathcal{L} + \lambda |\theta|_1$ directly penalizes non-zero weights
- Sparse autoencoders: train to reconstruct with a sparsity penalty on the bottleneck
- Pruning: train dense, then remove small weights
Emergent:
- Large models often develop sparse activation patterns naturally
- Superposition: models pack more features than dimensions by using sparse, overlapping codes
The Safety Connection
Sparse, interpretable features are easier to monitor, verify, and constrain. If you can identify which features correspond to dangerous capabilities, you can intervene. Dense, entangled representations make this nearly impossible.