Bet 40 — Skip-each-layer ablation (FALSIFIED)
For each transformer layer in turn, replace it with the identity function (skip it entirely) and measure perplexity on a held-out text. The bet's hypothesis is that at least one of the four layers in FractalMoE 30M is "approximately redundant" — its contribution is small enough that the model survives skipping it. The bet falsifies this. Every layer, when skipped, produces a perplexity blowup of at least 11×. There is no skippable layer at this scale.
This bet matters because the federation's bandwidth strategy and fault-tolerance design depend on whether layer dropping is an option. If one layer were skippable, the federation could route around a missing or slow worker by simply omitting that layer's work. Because no layer is skippable, the federation must guarantee every layer's computation completes — fault-tolerance becomes a recovery problem (re-route the layer's work to a healthy worker) rather than a graceful-degradation problem (skip it and accept slightly lower quality).
Background — what layer-skipping would buy
The federation's pipeline-of-workers model is sensitive to per-worker reliability. If a worker dies during inference, the layer it was assigned to is undone — the residual stream has to be passed through an alternative worker for that layer, or the inference has to abort. The cost of recovery is proportional to how strict the layer-completion requirement is.
If layers were skippable, recovery would be trivial: a worker dies, skip the layer, accept some perplexity hit, continue inference. The federation's tolerance for worker failures would be high. Throttle-aware scheduling (Bet 45) would be unnecessary if you could just skip a slow worker's layer.
The ML literature has examples of layer-skipping working at large scale. Early-exit transformers skip later layers when an early-exit classifier is confident. LayerDrop trains transformers to be robust to random layer skipping at inference time. Both demonstrate that layer-skipping is a real optimisation at certain scales and training regimes.
The bet asks: does the federation's regime — a small (30M) base, no LayerDrop training — admit any skippable layers?
Hypothesis
For at least one layer index ∈ {0, 1, 2, 3}, skipping that layer (replacing it with identity) increases perplexity by at most 3× on a held-out text.
The 3× LENIENT bar is the threshold for a layer being "soft-skippable": a measurable but tolerable hit. The 1.5× STRICT bar is for a layer being effectively redundant.
Pre-registered criteria
- STRICT: at least one layer skippable with ≤ 1.5× ppl hit.
- LENIENT: ≤ 3× hit on at least one layer.
- CATASTROPHIC: every layer skipped produces ≥ 10× hit (would falsify any "drop a layer" optimisation at this scale).
Setup
- Model: FractalMoE 30M, 4 transformer blocks (block indices 0-3).
- For each block index
k∈{0, 1, 2, 3}: replace blockk's forward function withlambda x: x(identity), keep all other blocks intact, measure perplexity on a 1000-token held-out text. - Baseline: no blocks skipped, same eval text.
This is the simplest possible layer-skip experiment. It doesn't try LayerDrop-style training-time augmentation, doesn't try to recover from the skip with fine-tuning, doesn't try early-exit with calibrated confidence. The bet just asks whether any single layer's contribution is small enough to drop on its own.
Result — CATASTROPHIC
| Skipped block | Perplexity | Ratio vs baseline | |---|---|---| | (none, baseline) | 89 | 1.0× | | Block 0 | 7,120 | 80× | | Block 1 | 2,134 | 24× | | Block 2 | 1,156 | 13× | | Block 3 | 980 | 11× |
Every layer produces ≥ 10× perplexity blowup when skipped. The smallest hit (block 3, the last block) is 11×; the largest (block 0, the first block) is 80×. There is no layer whose contribution is small enough to drop.
The pattern is monotonic: earlier layers are more critical than later layers. Block 0 (the first layer, immediately after embeddings) is the most damaging to skip — its output sets up the representation for all subsequent blocks, and skipping it means the rest of the model is computing on raw token embeddings that haven't been mixed by attention. Block 3 (the last layer) is the least damaging to skip, but even its 11× hit is far beyond the LENIENT bar.
Why it failed — the parameter floor argument
A 30M-parameter model is operating close to the parameter floor for the language-modelling task. The model has just enough capacity to fit the language distribution; there's no slack. Each of the 4 blocks contributes a non-trivial transformation that all subsequent blocks depend on.
The geometry: the residual stream's representation evolves through the blocks. Block 0 introduces attention-mixed information from neighbouring tokens. Block 1 builds on that to introduce higher-order interactions. Block 2 and 3 further refine. Removing any block leaves a hole in the chain — the next block expects an input distribution that the missing block was supposed to produce, and the input it actually receives is different in non-trivial ways.
At larger scale (70B+ parameters), there's substantial parametric slack. Some layers are in fact more redundant than others; LayerDrop and early-exit exploit this. But that's a scale-dependent property: the redundancy emerges as the model has more parameters than it needs for the task. At 30M, there's no redundancy to exploit.
The first-layer-most-critical pattern is also consistent with this picture. The earliest layers do the heaviest lifting in terms of converting raw embeddings into a richer representation. Later layers refine; they have something to refine. Skipping block 0 leaves block 1 with raw embeddings as input, which is not what block 1 was trained to handle.
What this means for the federation
Three concrete consequences:
-
The federation cannot drop a layer when a worker fails. The pipeline-recovery protocol must guarantee every layer's computation completes. If a worker dies mid-inference, the federation either re-routes that layer's work to a healthy worker (the chosen approach in Bet 41-45) or aborts the inference. Skipping the failed layer is not an option.
-
Layer-level fault tolerance has to be designed for recovery, not graceful degradation. Bet 45 (throttle-invariance) confirms the scheduler tolerates slow workers without dropping their layers. Bet 44 (Byzantine aggregation) confirms a worker that returns wrong results gets its layer re-computed. Both are recovery mechanisms, not skipping mechanisms.
-
Layer-level compression is harder than expected. Quantising within a layer (Bet 13: 1.58-bit ternary) works. Skipping a layer entirely doesn't. The wire-format defaults reflect this — compress within layers, don't drop them. The federation's bandwidth budget assumes every layer's weights are shipped.
What this leaves open
- Layer-skipping at 1B+ scale. At larger scales, layer-skipping becomes a known optimisation. A 70B model has more parametric slack than a 30M model; some layers may be skippable. The federation's deployment scale is base 30M + per-user adapters; the at-scale layer-skipping question is open but not load-bearing for current deployment.
- LayerDrop-style training. If the base model were trained with LayerDrop augmentation (random layer dropping at training time), the resulting model might be more robust to skipping at inference. This is a substantial training-pipeline change and isn't part of the federation's current base; it's a candidate for future federation versions if layer-skipping becomes important.
- Early-exit with confidence calibration. A model that skips later layers when an early-exit classifier is confident about the prediction could save inference cost. This is a separate question (it's about adaptive computation depth, not fault tolerance) and the federation hasn't tested it. Plausibly works at scale; uncertain at 30M.
- Fault-tolerant architectures. A model architecture designed with skippable paths (e.g. parallel branches that vote on the residual update) could be inherently fault-tolerant. This is a research direction, not a current federation feature.
Run command
PYTHONPATH=src python -m experiments.bets.40_skip_layer
Output: experiments/bets/results/40_skip_layer.json records the per-block perplexity when that block is skipped, the baseline perplexity, the eval text identifier, and the per-token loss profile (which token positions degrade most when each block is skipped — useful for understanding which blocks handle which kinds of context).
Related entries
- Bet 13: 1.58-bit ternary quantisation. Within-layer compression that works.
- Bet 38: expert collapse. Within-MoE compression that fails. The shared lesson: at this scale, every learned component is load-bearing.
- Bet 41-45: pipeline recovery and throttle-invariance. The fault-tolerance design that follows from this bet's falsification.
- Bet 48: magnitude pruning within layers. The complementary compression strategy.
Why it stays in the catalogue
A reader proposing "drop a layer to save bandwidth" or "skip a layer when a worker fails" must encounter this falsification first. The proposal is intuitive — large transformers do have skippable layers, and the ML literature has working examples — but at the federation's target scale (30M base) the proposal doesn't hold. Linking the bet from the catalogue closes the door at the right scale.
The methodological lesson is the scale-dependence of ML optimisations. Many published results have implicit scale preconditions (some only work at large models; some only work at small models). The catalogue's empirical discipline is to validate every optimisation at the federation's deployment scale, not to assume scale-invariance. Bet 40's clean failure at 30M, combined with the literature's evidence at 70B+, is a useful data point for the catalogue's broader thesis: federation primitives need to be tested where the federation actually operates.