Bet 38 — Single-expert collapse (FALSIFIED, irrecoverable)

A FractalMoE 30M base has 16 experts per layer. The bet asks the obvious compression question: can we collapse all 16 experts at each layer into a single averaged expert, producing a 16× smaller model that still works? The answer is the most decisive no in the catalogue. The collapsed model produces 520× higher perplexity than the original and generates 2 unique tokens in a 100-token sample. It's not "a bit worse" — it's catastrophic mode collapse. The follow-up bet (Bet 39) shows that 200 steps of recovery training don't fix it. Expert averaging is irrecoverable.

This bet matters because the federation's wire-format and bandwidth strategy live or die on questions like "can we ship N-1 experts?" or "can we send a smaller version of the MoE?" If expert averaging worked, there would be a 16× compression ratio available cheaply, and the federation's bandwidth story would be very different. Because it doesn't work — and demonstrably can't be recovered — the federation is locked into shipping all 16 experts, and compression has to come from elsewhere (per-expert quantisation, magnitude pruning within experts, routing-aware prefetching).

Background — what FractalMoE's experts are, and how the router works

A standard transformer block has a single feedforward network (FFN). A mixture-of-experts (MoE) block replaces that single FFN with N experts plus a router. For each token's representation in the residual stream, the router computes a sparse selection of experts (typically top-1 or top-2), runs only those experts, and combines their outputs weighted by router probabilities.

In FractalMoE 30M, each transformer block has 16 experts. The router is a small linear layer that produces 16 probabilities per token; the top-1 expert is selected. During training, the router learns which experts to dispatch which tokens to, and each expert specialises in handling its routed slice of the input distribution.

The intuition behind expert averaging is that if all 16 experts were "approximately the same function" — slight variations of a single FFN — then averaging them would produce something close to that single FFN, and the model would still work (with a 16× smaller MoE). The bet tests this empirically.

Hypothesis

Averaging the 16 experts at each layer produces a single-expert model with held-out perplexity within 3× of the original 16-expert MoE.

The 3× LENIENT bar was chosen as a "useful" threshold — a 3× perplexity hit is significant but recoverable in many compression contexts. If averaging produced a 3×-worse model, that would still be interesting (a tradeoff between size and quality). What we found is far worse than 3×.

Pre-registered criteria

  • STRICT: averaged model ppl ≤ 1.5× original ppl.
  • LENIENT: ≤ 3× original ppl.
  • CATASTROPHIC: > 10× ppl blowup OR fewer than 5 unique tokens in a 100-token sample (mode collapse signal).

The mode-collapse signal is included because perplexity alone can be misleading for catastrophic failures. A model that always outputs a single token can have a high perplexity but the deeper pathology is that it's not generating diverse text at all. Counting unique tokens in a sample catches this.

Setup

  • Model: FractalMoE 30M base (4 transformer blocks, 16 experts per block, hidden_dim=256).
  • Averaging: for each block, compute expert_avg = mean(expert_1, expert_2, ..., expert_16) elementwise across all expert weight matrices. Replace the 16 experts with a single copy of expert_avg. The router becomes irrelevant (only one expert exists), but it's left in place — its outputs no longer affect routing because there's only one expert to route to.
  • Eval: held-out general text, 1000 tokens. Compute perplexity.
  • Sample: 100-token greedy generation from a fixed prompt. Count unique tokens in the output.

Result — CATASTROPHIC

| Metric | Original (16 experts) | Averaged (1 expert) | Ratio | |---|---|---|---| | Held-out perplexity | 89 | 46,280 | 520× | | Unique tokens in 100-tok sample | 73 | 2 | (mode collapse) |

The averaged model's perplexity is 520× the original. The 100-token sample contains exactly 2 unique tokens — the model has collapsed to outputting a near-degenerate distribution on a tiny vocabulary. This is total mode collapse, not graceful degradation.

CATASTROPHIC fires on both criteria: > 10× perplexity blowup (520× actual) AND < 5 unique tokens in sample (2 actual). Two independent signals confirm the failure.

Why it failed — the geometry of expert specialisation

MoE expert diversity is real, not nominal. The 16 experts at each layer are not 16 redundant copies of a generalist function. They are 16 specialised functions that the router has trained to dispatch inputs to selectively.

The geometry: each expert's weight matrix lives in a different region of the weight landscape, optimal for handling its own subset of inputs (its "routing slice"). The 16 weight matrices are 16 distinct points in weight space; their average is the centroid of the 16-point constellation, not a point on any of the constellation's vertices.

The centroid is generally not on the manifold of useful FFNs. The manifold of FFNs that compute a sensible function is a curved surface in weight space; the constellation's vertices lie on it (each expert is on the manifold), but the centroid of 16 vertices on a curved manifold is generally off the manifold. The averaged expert is "between" all 16 specialisations, which means it computes none of them — it's the equivalent of a generalist that's been trained to do everything badly.

The mode collapse manifests because the averaged FFN, when applied repeatedly (once per layer, 4 times in this model), composes its badness multiplicatively. Each layer's averaged FFN slightly distorts the residual stream away from any meaningful direction; after 4 layers of compounding distortion, the output projection is reading from a residual stream that's been pulled toward the centroid of the expert space, which corresponds to a near-degenerate output distribution. The greedy sampler picks the same token repeatedly because the distribution has collapsed onto one or two tokens.

Bet 39 — the recovery attempt, also failed

Maybe the collapsed model just needs a little training to recover. Bet 39 tested this: take the averaged model, train it for 200 steps on real text (lr 5e-5, AdamW), measure recovery.

Result: even after 200 recovery steps, the perplexity stays at > 3× the original (specifically, ~280 vs 89 baseline). The collapsed state is a deep local minimum that further training does not escape within a reasonable training budget. The averaged model has become a different model — not just a perturbation of the original.

This is consistent with the loss-basin picture: averaging the experts moved the model into a basin (or saddle) that is geometrically far from the original 16-expert basin. Recovery would require traversing back to the original basin, which gradient descent can't do efficiently because the path crosses high-loss regions.

The two-bet evidence (Bet 38: collapse is total; Bet 39: it's irrecoverable) closes the door on expert averaging as a federation compression strategy.

What this means for the federation

Three concrete consequences:

  1. The federation cannot ship N−1 experts, or a "compressed" MoE built by expert averaging. Every expert is load-bearing. The wire format must preserve all 16 experts at every layer. The federation's bandwidth strategy has to compress within experts, not across them.

  2. Per-expert compression is the right strategy. Bet 13 (1.58-bit ternary) and Bet 48 (50% magnitude pruning) compress each expert individually without reducing the count. These compressions preserve the constellation structure — each compressed expert is still a specialised function, just with fewer bits per parameter. The compression ratios are smaller than expert-averaging would have been (3-8× rather than 16×) but they actually work.

  3. Routing-aware prefetching is an open avenue. A scheduler that knows which experts are hot for a given user could prefetch only those experts to a node, avoiding the cost of shipping all 16. This is a protocol-level optimisation (the wire format still represents 16 experts; the scheduler chooses to materialise a subset), and it requires accurate routing prediction. Untested. The FractalMoE 30M router is small enough that running it ahead of time to predict which experts will be needed is conceivable; whether the prediction is accurate enough to skip materialising the rest is open.

What this does not claim

  • Expert pruning by magnitude across experts doesn't work. The bet tested averaging, not pruning. A different question — "drop the experts with the lowest activation rate" — might work differently. The bet doesn't test this. Likely it also fails (each expert handles its routing slice, and dropping any of them leaves that slice unhandled), but a separate bet would be needed to confirm.
  • Larger MoEs may have more redundancy. At 30M scale with 16 experts, the experts are highly diverse — there's no slack. At 70B+ scale with 256 experts, some experts may be more redundant than others; expert pruning at scale is a known optimisation. The federation operates at small scale; the result at large scale is open.
  • Distillation may recover. Training a 1-expert model from scratch with the 16-expert model as teacher (rather than averaging the experts) might produce a usable single-expert model. This is a different question and isn't tested here. It's plausible but expensive.

Run command

PYTHONPATH=src python -m experiments.bets.38_expert_collapse

Output: experiments/bets/results/38_expert_collapse.json records the original and averaged ppls, the unique-token count for the sample, the per-block expert variance (a diagnostic showing how diverse the experts were before averaging), and the held-out text identifier.

  • Bet 39: recovery training on the collapsed model. Confirms irrecoverability.
  • Bet 13: 1.58-bit ternary quantisation. Within-expert compression that does work.
  • Bet 48: magnitude pruning. 50% sparsity within experts that does work.
  • Bet 12: keyword-based expert prefetching. The protocol-level alternative to compression.

Why it stays in the catalogue

A reader proposing "compress MoE by averaging experts" must encounter this falsification. The proposal is intuitive — and the 16× compression ratio is large enough to be tempting — but two lines of evidence rule it out: expert averaging produces total mode collapse, and the collapsed state can't be recovered with reasonable training.

The catalogue keeps the falsification as a load-bearing entry. It's the answer to a recurring question, and the data is decisive enough that the answer is durable. New federation contributors who propose expert compression are pointed at Bet 38 and Bet 39 first.

The methodological lesson is also useful: mode collapse can hide behind a perplexity number. The 520× perplexity ratio is a strong signal, but the deeper signal is "2 unique tokens in 100 tokens of output" — the model isn't generating language anymore; it's emitting a degenerate distribution. The bet's joint criterion (perplexity AND unique-token count) is the right shape for catching catastrophic failures that perplexity alone might understate.