Prior reading: Gradient Descent and Backpropagation

What Is Sparsity?

A sparse representation has mostly zeros. Instead of every neuron firing for every input, only a small subset activates.

Why Sparsity Is Good

  • Interpretability: If only 50 of 10,000 features fire, you have a chance of understanding what they represent
  • Efficiency: Sparse computations are cheaper
  • Generalization: Sparse models tend to overfit less — they can't memorize with fewer active parameters
  • Disentanglement: Sparse features tend to correspond to more independent, meaningful concepts

How We Achieve Sparsity

Architectural:

  • ReLU activations naturally produce zeros (negative inputs → 0)
  • Mixture of Experts: only a subset of parameters active per input
  • Dropout during training: random sparsity as regularization

Training-based:

  • L1 regularization: $\mathcal{L} + \lambda |\theta|_1$ directly penalizes non-zero weights
  • Sparse autoencoders: train to reconstruct with a sparsity penalty on the bottleneck
  • Pruning: train dense, then remove small weights

Emergent:

  • Large models often develop sparse activation patterns naturally
  • Superposition: models pack more features than dimensions by using sparse, overlapping codes

The Safety Connection

Sparse, interpretable features are easier to monitor, verify, and constrain. If you can identify which features correspond to dangerous capabilities, you can intervene. Dense, entangled representations make this nearly impossible.