Bet 37 — RMSNorm-gain fine-tune (BitFit-style)

The bet that produced the federation's load-bearing primitive. The premise was almost embarrassingly simple. Modern transformer architectures place a normalisation layer (RMSNorm in our case, after the practice that's now standard since LLaMA) before each attention block and each feed-forward block. Each of those normalisation layers has a tiny per-channel learned scaling vector — the gain parameter, sometimes written g or γ. At a 30M-parameter model with 4 transformer blocks and hidden_dim = 256, there are roughly 9 such normalisation layers (pre-attn × 4, pre-ffn × 4, plus a final norm), each contributing 256 trainable scalars. Total: about 2,300 parameters. About 9 KB on disk in fp16. Less than a small image.

The bet asked: if we freeze every other parameter in the model and only train these 2,300 scalars on a personal corpus, what happens?

The literature already had an answer for the question's bigger sibling. BitFit (Zaken et al., 2021) showed that fine-tuning only the bias parameters of a transformer is competitive with full fine-tuning on a number of GLUE tasks. LoRA (Hu et al., 2021) demonstrated that low-rank adapters can match or exceed full fine-tuning at a fraction of the parameter cost. The norm-gain restriction is in the same family — a structurally constrained adapter — but smaller still. RMSNorm in modern architectures often has no bias term at all (LLaMA-style implementations), so the gain parameter is structurally the only per-channel scalar that exists; restricting fine-tuning to those gains is the smallest meaningful adapter you can write without changing the architecture.

What we wanted to know was specific. Federation deployment requires the per-user adapter to be small enough that 215,000 of them — one per Kerala IT@School student, in the flagship deployment scenario — fit in a few GB total. Full fine-tune at ~155 MB per user is unworkable at 33 TB across the fleet. LoRA-r4 at ~96 KB per user is workable at ~21 GB. Norm-only at 9 KB per user is comfortable at 1.9 GB — small enough that the entire fleet's personalisation state could fit on a single laptop's SSD if needed, and small enough that gossip directories and settlement ledgers can index per-user state without blowing up.

So the question wasn't just "is norm-only competitive?" The question was "is norm-only at 9 KB per user good enough that federation can ship it as the default?"

Hypothesis

Restricting fine-tuning to the RMSNorm gain parameters of every normalisation layer in the model produces a per-user adapter that is competitive with full fine-tune on personal-text perplexity, at three orders of magnitude smaller adapter size.

Pre-registered criteria

The criteria were written into the module docstring before the run:

  • STRICT: norm-only ppl < 2× full-FT ppl on held-out same-author text, while adapter size < 1% of full FT.
  • LENIENT: norm-only ppl < 5× full-FT ppl, adapter size < 1%.
  • CATASTROPHIC: norm-only ppl > 10× full-FT ppl. This would indicate the norm gains aren't load-bearing for personalisation, and would falsify the entire federation default.

The criteria were chosen with the federation deployment scenario in mind. A 2× ppl gap would be acceptable because the storage savings (155 MB → 9 KB) buy a lot of forgiveness. A 10× gap would mean the adapter format isn't actually doing personalisation; we'd have to look elsewhere.

Setup

The model is the FractalMoE 30M with hidden_dim = 256, n_layers = 4, n_heads = 4, vocab = 49152, and 16 experts per FFN layer. The base was trained on a generalist mix; for this bet, the per-user corpus is the bets harness's "programmer" fixture (a few thousand tokens of Python source code, comments, and prose). The training setup:

  • Optimiser: AdamW, lr = 5e-5, betas = (0.9, 0.95), weight_decay = 0.
  • Steps: 100. Roughly 5 minutes wall-clock on M1 Max with MPS backend.
  • Held-out eval: 256-token slice from the same author, never seen during training, drawn from a different time slice than the training set to make the eval task closer to "next-token prediction on this user's future text."

The three configurations under comparison:

| Config | Trainable params | Disk size | |---|---|---| | Norm-only | ~2,300 (gains of every RMSNorm) | 9 KB | | LoRA-r4 (matched compute) | ~25,000 | 96 KB | | Full FT | ~38M (entire model) | 155 MB |

The same 5-minute, 100-step training schedule applies to all three. That's deliberate: the federation deployment scenario is "5 minutes of personalisation on the user's laptop", not "train to convergence in a datacentre." The bet is about what each adapter format produces under that fixed compute budget.

Result — STRICT PASS, by surprising margin

Norm-only didn't just match full fine-tune — it beat full fine-tune by 6× to 100,000× on held-out same-author text, depending on the eval text.

held-out eval text: programmer-fixture-eval-001
  base (no adapter):       ppl = 18,400
  norm-only:               ppl =     6.2
  LoRA-r4:                 ppl =     8.1
  full FT (100 steps):     ppl = 84,300

held-out eval text: programmer-fixture-eval-002
  base:                    ppl = 21,100
  norm-only:               ppl =     1.04
  LoRA-r4:                 ppl =     1.42
  full FT (100 steps):     ppl =    76.8

held-out eval text: programmer-fixture-eval-003
  base:                    ppl = 16,900
  norm-only:               ppl =     1.15
  full FT (100 steps):     ppl = 124,800

The headline reading: norm-only is the best of the three on every eval text, by orders of magnitude. The full fine-tune number is the eye-popping one — ppl = 124,800 is essentially "the model generates pure noise on this held-out text after 100 steps of full FT on a 5-minute corpus."

Why full FT performs so badly

Full FT has 38 million degrees of freedom and a 5-minute training corpus. That's a parameter-to-data ratio so far on the wrong side of the bias-variance tradeoff that the model is essentially overfitting to the training corpus by the second pass through it. Every parameter shifts to fit the personal text, including parameters that needed to remain general for held-out continuation. By step 50, the loss on the training corpus is near zero; by step 100, the model has memorised the training tokens and lost generalisation entirely.

Norm-only has 2,300 degrees of freedom against the same 5-minute corpus. The parameter-to-data ratio is now firmly on the right side. The norm gains have just enough capacity to learn per-channel scaling that emphasises features useful for the user's distribution, without enough capacity to memorise. The restriction on the trainable parameter set acts as an implicit regulariser — analogous to early stopping or weight decay, but stronger and structural.

LoRA-r4 sits between the two. With ~25,000 trainable parameters, it has more capacity than norm-only but far less than full FT. It's competitive with norm-only — usually within 1.5× — but doesn't quite match it on this fixture. The structural prior of "scale residual norms by per-channel multipliers" turns out to be a better prior than "rank-4 update to attention projections" for this 30M-scale, 5-minute personalisation regime.

What this rules out (provisionally)

  • The "personalisation requires lots of parameters" intuition. It doesn't, at least not at 30M scale on clean prose. 2,300 scalars are sufficient.
  • The "full fine-tune is the gold standard" framing. At a fixed 5-minute compute budget, full FT is actively harmful. The right framing is parameter-efficient adapters as the default, with full FT as a special case for when the compute budget is large and the corpus is large enough to support it.
  • The "norm-only is too restrictive" intuition. It's restrictive, but at 30M, the restriction is a feature: it prevents catastrophic overfit on small corpora.

What this does not rule out (and where the framing has been tightened)

The bet was run with a single seed and a single training/eval split. It also did not run a noise-floor negative control. The original Bet 37 framing — "norm-only fits the user, by 6× to 100,000× over full FT" — survived unchanged for many subsequent bets, but two follow-ups have since materially adjusted what we claim.

Bet 46 ran the same experiment with 5 random seeds × 3 distinct eval texts. Norm-only beat full FT in 15 of 15 configurations. Median ratio: norm-only ppl is 0.0066 of full FT — a 150× lower perplexity. This is a strong replication. Single-seed lottery is ruled out. Eval-text artefact is ruled out. Implementation bug specific to one configuration is ruled out, since the same code path runs all 15 trials.

Bet 49 ran the head-to-head against LoRA-r4 across 3 distinct user fixtures (programmer / novelist / scientist). Norm-only wins for all 3 users. The win margin against LoRA-r4 is much smaller than against full FT (~1.2× to 1.5× rather than 100,000×) but it's a clean Pareto-domination — lower ppl, smaller adapter, no general-text degradation. This made norm-only the production default.

Bet 60 is the bet that should have run alongside Bet 37 from the start: the noise-floor negative control. Train norm-only on random uniform tokens (no language signal) and compare to real-text training. The result is humbling. Random text beat real text on held-out programmer-fixture by a 6% margin (ppl = 6,525 vs ppl = 6,922). Real text wins on the other two users by 1.10× and 1.36×. The original Bet 37 framing implied that the 100,000× gap over full FT was 100% personalisation; Bet 60 makes clear that most of that gap is regularisation, not personalisation. The norm-only adapter trained on random tokens is also a dramatically better held-out predictor than full-FT-on-personal-text, because the regularisation effect dominates the personalisation effect on this corpus size.

Bet 61 is the disambiguator. Train one norm-only adapter per user. Evaluate every adapter on every user's held-out text. Read the diagonal. Own-adapter wins by 5–29% margin per user. That margin is the personalisation signal, cleanly separated from the regularisation effect. It's smaller than the original Bet 37 headline implied, but it's real, replicable, and present for every user tested.

The current framing of Bet 37 is: norm-only is a strong implicit regulariser that also captures a real but modest per-user personalisation signal. The 9 KB adapter format ships in production because both effects compose to produce the federation's per-user adapter; the regularisation effect makes it work at all on small corpora, and the personalisation signal makes it user-specific.

Implications for production

The federation defaults the per-user adapter format to norm-only because of these results, sharpened by Bets 46, 49, 60, and 61. Every economic argument in the deployment story is downstream of the 9 KB figure:

  • Kerala IT@School fleet sizing. 215,000 students × 9 KB = 1.9 GB total personalisation state across the fleet. The same fleet under full FT would be 33 TB; under LoRA-r4 it would be 21 GB. Norm-only is the only one of the three that fits comfortably on a single coordinator's local storage.
  • Gossip directory entry size. The directory must announce per-user adapter availability. With 9 KB adapters, the announcement can include a content hash plus a few bytes of metadata; the directory stays sub-megabyte even at fleet scale.
  • Settlement granularity. Pay-with-bandwidth (Bet 11) settles per inference; per-user adapter delivery becomes a sub-cent operation at 9 KB transfer cost.
  • Privacy and revocation. When a user wants their adapter deleted, deleting 9 KB is straightforward; deleting the relevant rows of a 155 MB full FT model is a more complex undertaking.

What this still doesn't claim

  • Generalisation beyond 30M scale. This entire result is at 30M parameters on short prose corpora. Whether norm-only retains its advantage at 1B+ parameters on heterogeneous real-user data is the production-readiness experiment that hasn't run yet. It's listed in the Open Questions chapter as one of the most consequential gaps.
  • Generalisation to non-prose users. The bets harness fixtures (programmer / novelist / scientist) are all prose-heavy. Users sending mostly emoji, mostly URLs, mostly broken grammar, or mostly multilingual code-switching are not covered. The norm-only primitive may need to be paired with steering vectors (Bet 58) or LoRA (Bet 49 fallback) for these cohorts.
  • Robustness under quantisation. Bet 52 showed norm-only composes with ternary base quantisation; Bet 53 showed quantising the adapter itself is borderline. Production ships ternary base + raw fp16 norm-only adapter; quantising the adapter further is a per-deployment policy decision.

Run command

PYTHONPATH=src python -m experiments.bets.37_norm_only

Output writes to experiments/bets/results/37_norm_only.json. The same module exposes the trained adapter as out/adapter_norm_only.pt; subsequent bets load it directly to compose with quantisation, steering vectors, or other adapter formats.

The Bet 37 → Bet 46 → Bet 49 → Bet 60 → Bet 61 chain is the most-cited path through the catalogue. Read in order:

  • Bet 37 (this entry): the initial result. Norm-only beats full FT by 6× to 100,000× on a single seed.
  • Bet 46: 15/15 replication across 5 seeds and 3 eval texts.
  • Bet 49: 3/3 STRICT shootout against LoRA-r4 and full FT across distinct users.
  • Bet 58: alternative primitive (last-layer steering) — wins for programmer user only.
  • Bet 60: the noise-floor negative control. Most of the apparent personalisation is regularisation.
  • Bet 61: the personalization-vs-regularisation confusion matrix. Own-adapter wins by 5–29% — the actual personalisation signal.
  • Bet 52: norm-only composes with ternary base quantisation.
  • Bet 53: int8 quantisation of the adapter itself is borderline; ship raw.
  • Bet 54: averaging across users is destructive. The federation must route, not average.
  • Bet 55: logit-ensembling adapters for unknown users does not generalise. Routing is non-optional.

Why it matters

This is the single most consequential bet in the catalogue. The federation's entire deployment economics are downstream of the 9 KB per-user figure, which is downstream of this bet, which is sharpened (and partly retracted) by the follow-ups. Reading Bet 37 in isolation gives an incomplete picture; reading the chain end-to-end gives the calibrated one. The catalogue keeps both readings visible because the calibration is the work.