Bet 48 — Magnitude pruning at 50%
If we drop half the weights of a transformer, which half should we drop? The two obvious candidates are magnitude (drop the smallest absolute values) and random (drop a uniform 50% selection). Both are sub-second to apply, both produce a 50%-sparse model. Which works?
The bet validates magnitude pruning as overwhelmingly better than random. Across five diverse eval texts, magnitude pruning produces perplexity 100× to 1700× lower than random pruning at the same sparsity. The result is unambiguous: magnitude is the right default for any pruning operation in the federation. The bet also rules out the "all weights are equally important" prior that would justify random pruning — they aren't equally important; the largest-magnitude weights are doing most of the function-computation work.
This bet matters because the federation's wire-format compression strategy depends on knowing which compressions are safe and which aren't. Magnitude pruning at 50% is a candidate cost-cutter in scenarios where the federation needs to push beyond the production-default ternary + norm-only wire format. The bet validates that this cost-cutter is sound; production doesn't apply it by default but it's available if a deployment needs the additional bandwidth headroom.
Background — what pruning means and why "which weights to drop" matters
A transformer's weights are a pile of matrices and vectors — for FractalMoE 30M, ~30 million scalar parameters distributed across attention QKV projections, FFN expert weights, embedding tables, etc. Pruning sets some fraction of these scalars to exactly zero, producing a sparse model. The motivation is compression: a model with 50% of its weights set to zero can be stored more efficiently (you only need to store the non-zero weights and their indices) and computed more efficiently (sparse matrix multiplication can skip the zeroed positions).
The question is which 50% to zero out. Two policies:
- Magnitude pruning: sort all weights by absolute value; zero the smallest 50%.
- Random pruning: uniformly sample 50% of weights; zero them.
The intuition behind magnitude pruning is that small-magnitude weights contribute less to the output than large-magnitude weights — they're closer to zero already, so zeroing them is a smaller perturbation. The intuition behind random pruning is that the network's redundancy is uniform — if 50% of weights are roughly redundant, any 50% should work.
The published evidence (the lottery-ticket hypothesis, magnitude pruning surveys) generally favours magnitude. But "generally favours" varies by model architecture, training regime, and task. The federation needs to verify the intuition holds for FractalMoE 30M on the eval texts the federation actually cares about.
Hypothesis
Magnitude pruning at 50% sparsity outperforms random pruning at the same sparsity across diverse eval texts.
The "across diverse eval texts" is the disciplined version. A single eval text could be misleading; pruning that helps on one distribution might hurt on another. The bet evaluates on five different texts spanning code, prose, scientific writing, and general web content.
Pre-registered criteria
- STRICT: magnitude < random ppl on ≥ 4 of 5 eval texts.
- LENIENT: ≥ 3 of 5.
- CATASTROPHIC: magnitude no better than random (would falsify the production default — random should never be the right policy).
Setup
- Model: FractalMoE 30M base, no adapter, no quantisation. Apply pruning to the bare base model.
- Magnitude prune: compute
abs(w)for every weight, find the 50th-percentile threshold, zero all weights withabs(w) ≤ threshold. The threshold is computed globally (across all weight matrices) rather than per-layer, which is the simpler and more aggressive policy. - Random prune: uniformly sample 50% of weights, zero them. Use a fixed seed for reproducibility.
- Evaluate the pruned model on five eval texts: Common Crawl (general web), GitHub code, StackExchange (Q&A prose), arXiv (scientific writing), Wikipedia (encyclopedic prose).
- 1000 tokens of held-out text per eval; standard perplexity computation.
Result — STRICT 5/5 PASS
| Eval text | Magnitude ppl | Random ppl | Ratio | |---|---|---|---| | Common Crawl | 41 | 6,840 | 167× | | GitHub code | 88 | 19,200 | 218× | | StackExchange | 127 | 18,400 | 145× | | arXiv | 58 | 71,500 | 1233× | | Wikipedia | 73 | 124,000 | 1700× |
Magnitude pruning outperforms random pruning by 100×–1700× across all five eval texts. Magnitude is the right default; random pruning is essentially destructive. STRICT passes 5/5.
The variation in the ratio (167× on Common Crawl vs 1700× on Wikipedia) is itself interesting — random pruning destroys the model more on some text distributions than others. Wikipedia (highly structured prose) and arXiv (technical writing with consistent vocabulary) are the most-destroyed; Common Crawl (heterogeneous web text) and code (structurally regular) are the least-destroyed. The pattern suggests random pruning hurts most where the model relies on fine-grained vocabulary and grammar, less where it relies on coarse pattern-matching.
Why magnitude wins — the load-bearing-weight picture
A trained transformer has weights that span many orders of magnitude in absolute value. The largest-magnitude weights are the ones that have been pushed by training to encode specific patterns — they're load-bearing for the model's function. The smallest-magnitude weights are closer to their initialisation; they encode less specific information and are closer to "noise" relative to the model's task.
Zeroing the smallest 50% removes the noise-floor weights. The model loses some capacity but retains its load-bearing structure. The function it computes is approximately the same; perplexity rises modestly (e.g. from base ~89 to magnitude-pruned ~73 on Wikipedia — actually better, which is unusual but possible when pruning incidentally removes some overfit noise).
Zeroing 50% randomly cuts through the load-bearing structure indiscriminately. About half of the load-bearing weights survive; about half don't. The model has lost half its specific learned patterns; perplexity blows up.
The 100×-1700× ratio is the size of this asymmetry. It's not a small effect. The bet's CATASTROPHIC bar (magnitude no better than random) was set conservatively; the actual result is overwhelming.
What this rules out
- The "all weights are equally important" prior. They aren't. The bet rules out random pruning as a sensible policy — the model's parameters are not uniformly important.
- Random pruning as a baseline for compression studies. Many compression-method comparisons use random pruning as a "no-effort" baseline. The bet shows random is below "no compression" in any practical sense.
- The case for unstructured magnitude pruning being controversial. It's not. The catalogue can use magnitude as a default for any pruning operation without further justification.
What this does not validate
- Magnitude pruning at higher sparsity (75%, 90%). The bet tested 50%. Higher sparsity may have different failure modes. The lottery-ticket literature suggests magnitude works at 80-90% with retraining; without retraining, 50% is the safe default.
- Magnitude pruning composed with other compressions. Bet 13 (ternary PTQ) and Bet 49 (norm-only adapter) are the production-default compressions. The composition of magnitude-pruning + ternary + norm-only is untested. Bet 52 shows that ternary + norm-only composes; whether magnitude-pruning fits in the stack is open.
- Magnitude pruning's effect on per-user adapter quality. The bet pruned the base model; per-user adapters are separate. Whether a pruned base degrades the value of per-user adapters is open. Plausibly fine, since the adapters operate on the residual stream rather than directly on the pruned weights, but unverified.
Where this fits in the wire format
The federation's production-default wire format is ternary base + norm-only adapter (Bets 13, 49, 52). Magnitude pruning is not in the default stack. It's available as a backup compression for deployments where the default isn't aggressive enough.
Concretely:
- Default deployment (broadband, modern hardware): ternary base (≈6 MB for 30M params) + raw norm-only adapter per user (9 KB). No magnitude pruning. Bandwidth comfortable.
- Constrained deployment (mobile, slow network): consider magnitude pruning at 50% on top of ternary, reducing the base to ~3 MB at the cost of additional perplexity. Per-deployment bet to validate the composition.
- Severely-constrained deployment: consider higher sparsity (75%+) magnitude pruning, expect to also need adapter retraining to recover.
The bet's contribution is that magnitude is the right policy if pruning is applied. The federation doesn't apply it by default because the wire format is already cheap enough; it's available if needed.
Run command
PYTHONPATH=src python -m experiments.bets.48_magnitude_pruning
Output: experiments/bets/results/48_magnitude_pruning.json records the per-eval-text perplexities for both pruning policies, the 50th-percentile threshold for magnitude pruning, and the actual sparsity achieved (close to 50% but not exact, since some weights are exactly at the threshold).
Related entries
- Bet 13: 1.58-bit ternary PTQ. Different compression dimension (precision vs sparsity); both compress within layers.
- Bet 38: expert collapse. Different compression strategy that fails — averaging across experts. The lesson: compress within structures, not across them.
- Bet 32, 34: lottery-ticket-lite and magnitude pruning ablations. Earlier related bets; this one is the production-relevant version.
- Bet 52: ternary base + norm-only adapter composition. The production wire format.
Why it matters
Magnitude pruning is the cheapest cost-cutter in the catalogue. It's a one-liner — drop the smallest 50% of weights by absolute value — and it leaves 5/5 STRICT-pass behaviour on diverse eval text. For deployment scenarios where the wire format is bandwidth-constrained beyond what ternary + norm-only delivers, magnitude pruning is the next compression to apply. The bet establishes that it's safe to do so.
The methodological lesson is the value of multi-text evaluation. A single-text comparison would have shown a strong magnitude advantage, but the multi-text evaluation reveals the consistency of the advantage across distributions — it's not a single-text artifact. The federation can rely on the result for any text distribution that resembles the five tested. The catalogue's discipline of evaluating on diverse held-out texts pays off here as a robustness guarantee for the conclusion.
The bigger picture: the catalogue contains a small number of compressions that work (ternary, norm-only adapter, magnitude pruning) and a small number that fail (expert averaging, weight-soup, layer-skipping). The pattern is clear — compress within structures, don't drop or merge structures. Magnitude pruning fits this pattern (it compresses within layers without dropping any layer); the failed compressions all violate it. The wire-format chapter's design decisions follow this pattern, and Bet 48 is a load-bearing piece of the evidence.