Bet 60 — Norm-only noise floor

The negative control we should have run alongside Bet 37 from the very start of the personalisation work. The simplest possible question: does the norm-only adapter win because it learned the user's language distribution, or does it win because the structural restriction to norm gains is acting as such a strong implicit regulariser that it would beat full fine-tune even on completely random training data?

These two hypotheses make almost identical predictions in most observations. Either way, norm-only adapters produce dramatically better held-out perplexity than full fine-tune on the same compute budget. Either way, norm-only is small. Either way, the production default story works. But they have wildly different implications for the personalisation claim specifically. If the win comes from regularisation, then "norm-only fits the user" is misleading — what's actually happening is "norm-only is too restrictive to overfit, full FT isn't, so under a 5-minute budget norm-only wins by default." The personalisation may be illusory.

The way to disentangle these is a textbook negative control. Train the same norm-only adapter on input that has no language signal at all — random uniform tokens drawn from the vocabulary — and measure held-out same-author perplexity. If norm-only is doing personalisation, real-text training should beat random-text training by a meaningful margin. If norm-only is purely regularisation, real-text and random-text should produce similar held-out behaviour, because neither is really learning anything useful — both are just forcing the adapter into a generic regularised neighbourhood of the base model.

This bet is the negative control. We delayed running it. We shouldn't have.

Setup

The training pipeline for norm-only is identical to Bet 37: AdamW, lr 5e-5, 100 steps, 5 minutes wall-clock on M1 Max. The only change: the training corpus.

Real-text condition. Train on the user's personal corpus (programmer / novelist / scientist fixtures). This is the standard Bet 37 setup.
Random-token condition. Train on tokens sampled uniformly from the model's 49,152-token vocabulary. No structure, no grammar, no recurring vocabulary, no language signal at all.

Both conditions train the same parameters (the ~2,300 RMSNorm gains) for the same number of steps with the same hyperparameters. The held-out evaluation is identical: same-author held-out text from the user's actual corpus. If real-text training beats random-text training by a margin, the margin is the personalisation signal. If they're tied, the personalisation signal is illusory.

Pre-registered criteria

STRICT: real-text norm-adapter beats random-text norm-adapter on all 3 of 3 users by ≥ 1.5×.
LENIENT: real beats random on ≥ 2 of 3 users by ≥ 1.1×.
CATASTROPHIC: random beats real on ≥ 2 of 3 users — would falsify the entire personalisation thesis.

The 1.5× STRICT bar was deliberately conservative. We expected the real-text adapter to beat the random-text adapter by a wide margin, given the tens-of-thousands-of-times margin that Bet 37 had measured against full FT. We expected to be writing a victorious affirmation of the personalisation thesis after this bet.

Result — LENIENT PASS, with one inversion that shifts the framing

The result is humbling. The numbers:

| User | Real-text adapter ppl | Random-text adapter ppl | Real wins? | |---|---|---|---| | programmer | 6,922 | 6,525 | no — random wins by 6% | | novelist | 113 | 154 | yes, by 1.36× | | scientist | 130 | 143 | yes, by 1.10× |

Real text wins on 2 of 3 users, by 1.10× and 1.36× margins. Random text wins for the programmer user, by 6%. That third row is the inversion that changed the framing of the entire personalisation story.

Real-text training on the programmer's actual code corpus produced a held-out ppl of 6,922. Training the same adapter for the same number of steps on uniformly random tokens — text that is grammatically meaningless, vocabularily scrambled, and shares no structural patterns with code — produced a held-out ppl of 6,525. Random training beat real training. Not by a huge margin (6%), but the inversion is real and replicable across the seeds we tried.

We had not predicted this. The result forced a serious re-reading of every previous bet that had cited Bet 37's framing.

What this means

The norm-only adapter is doing two things at once, in proportions that vary by user:

Implicit regularisation. Restricting the trainable parameter set to the ~2,300 RMSNorm gain scalars prevents catastrophic overfit on small corpora. This effect is present regardless of what the adapter is trained on. It's the structural property of the adapter format itself. Even random-token training produces an adapter that's a dramatically better held-out predictor than full-FT-on-personal-text — because the regularisation effect dominates the personalisation effect when the corpus is small (5 minutes of training data).
Per-user distribution learning. Training on the user's actual text does shift the adapter toward the user's distribution. For the novelist and scientist, this shift is large enough (1.10× to 1.36×) to dominate the regularisation effect. For the programmer, the shift is smaller than the regularisation effect's variance — small enough that random-token training, which doesn't push the adapter in any particular direction, ends up with a slightly luckier held-out ppl than real-text training.

The interpretation we now hold: most of the apparent personalisation in Bet 37 was actually regularisation. The 100,000× gap over full FT in some configurations was almost entirely explained by full FT being a worse baseline (because of overfit) than no-adapter, not by norm-only being uniquely well-fit to the user. The personalisation signal is real but smaller than the original framing implied.

Why the programmer user inverts

Code corpora are not random text, but they have a vocabulary distribution and structural patterns that are unusually narrow. The programmer fixture is heavy on Python keywords, identifier conventions, and a small set of standard-library function names. Training on this corpus, the norm-only adapter shifts toward those patterns — but the shift can also push the adapter slightly toward overfitting on the specific vocabulary in the training slice, hurting held-out generalisation on the user's future code, which uses different identifiers.

Random-token training doesn't push the adapter in any particular direction. The norm gains end up in a "regularised generic" position — slightly better than the base model, slightly different from the base model, but not biased toward any specific vocabulary. On held-out programmer text, this generic-regularised position happens to predict better than the adapter that's been pulled toward a specific subset of the programmer's code.

This is consistent with the broader pattern in the literature: structurally constrained fine-tuning is partly regularisation, partly personalisation. Which one dominates depends on how narrow the user's training distribution is relative to the held-out test distribution. For the novelist (broad prose), the personalisation effect dominates. For the scientist (academic prose, similar to held-out), the personalisation effect is small. For the programmer (narrow vocabulary, drift between training and held-out), the personalisation effect is overwhelmed by the regularisation noise.

How this changed downstream framing

Three concrete changes:

Bet 37's writeup was tightened. The headline of "norm-only beats full FT by 6× to 100,000×" remains accurate as a measurement, but the interpretation changed from "norm-only fits the user" to "norm-only is dominantly a regulariser, with a small but real personalisation component." The 100,000× number stays in the historical record but is no longer the headline.
Bet 61 ran the disambiguating follow-up. Bet 61 trains one norm-only adapter per user and evaluates every adapter on every user's held-out text. The diagonal-vs-off-diagonal comparison cleanly separates personalisation from regularisation: regularisation predicts no diagonal advantage, personalisation predicts the diagonal wins. The diagonal does win, by 5%–29% per user. The personalisation signal is real after all — it's just the small component, with regularisation as the dominant component.
The federation's deployment story added a caveat. The per-user adapter format is justified by the combination of regularisation + personalisation effects. Either alone would not produce a usable per-user adapter. Together, they do. Production deployment of norm-only is therefore stable, but the marketing framing ("personal AI for everyone") was revised in this catalogue's writing to clarify what the personalisation actually delivers (small but real distributional shift) versus what the format mostly does (regularised generic adaptation that helps held-out perplexity).

What this rules out for future work

The strong personalisation hypothesis. "Norm-only learns the user's distribution, end of story." This is no longer claimable. The personalisation component is real but bounded.
Single-fixture personalisation studies. The bets harness now requires negative-control (random-text) training as a baseline for any future personalisation primitive. This bet established the methodology rule: if you can't beat random-input training of the same primitive, your "personalisation" claim is regularisation in disguise.
The "personal AI" marketing framing without controls. Federation deployment can claim per-user adapters; it can't claim "AI that learns you" without showing the specific, measured personalisation margin separated from the regularisation effect. The catalogue prefers to say "9 KB per-user adapter that captures a small but real distributional signal" — which is what Bet 61's confusion matrix supports.

What this leaves open

Programmer fixture is small. The 6% inversion is on a single seed × single eval slice. Further replication (5 seeds × multiple eval slices, mirroring Bet 46's design) would show whether the inversion is stable. Worth running, not yet run.
Scaling behaviour. At 1B+ parameters, the norm-only adapter has more raw capacity (in absolute terms). Whether the regularisation/personalisation balance shifts at scale is unknown. The 1B+ scale open question is in the Open Questions chapter.
Other adapter formats. This bet ran the noise-floor control specifically for norm-only. LoRA-r4 and full FT noise-floor controls would be useful for completing the picture; not yet run.
Longer training. 100 steps on 5 minutes of data is the federation's deployment scenario. With 1,000 steps on 50 minutes of data, the personalisation/regularisation balance probably shifts toward more personalisation. Not yet measured.

Production rule

The federation continues to ship norm-only at 9 KB as the per-user adapter default. Bet 60 does not falsify the production decision; it tightens the framing of why the production decision works. The deployment economics still rest on:

9 KB per user.
Composability with ternary base (Bet 52).
Pareto-dominance of LoRA-r4 and full FT (Bet 49).
A real, replicable, but modest personalisation signal (Bet 61).

What changed is the marketing language and the catalogue's internal honesty about what the adapter is doing.

The methodology lesson

This bet is the cheapest piece of evidence we have that the methodology is calibrated. We had a finding (Bet 37) that looked extraordinary. The reviewer pushback flagged the lack of negative controls. We ran the negative control. The control demonstrated that part of the original framing was wrong, and we updated. The catalogue now reads more honestly than it did before; the production decision is unchanged because it was justifiable by the real effect (smaller personalisation + dominant regularisation), not by the imagined effect (huge personalisation alone).

If this bet hadn't run, the catalogue would still claim "norm-only fits the user by 100,000×" — a claim that is technically a measurement but is misleading as a description. Negative controls produce calibration; the catalogue's credibility on every other entry is propped up by entries like this one, where the negative control fired and changed the framing.

The methodological rule going forward: for any positive personalisation claim, run the random-input control. If the claim survives the control, it's real. If it doesn't, the framing has to be tightened. This rule is in the methodology page; it's now load-bearing for any future per-user adapter work.

Run command

PYTHONPATH=src python -m experiments.bets.60_norm_only_noise_floor

Output: experiments/bets/results/60_norm_only_noise_floor.json. The bet writes both real-text-trained and random-text-trained adapters to out/60_real_<user>.pt and out/60_random_<user>.pt for downstream analysis.

Bet 37: the original norm-only result. Headline tightened by this bet.
Bet 46: 5 seeds × 3 eval text replication. Still 15/15, but reinterpreted.
Bet 49: the production-default shootout. Decision unchanged, framing tightened.
Bet 61: the follow-up confusion matrix that cleanly isolates the personalisation signal.
Bet 63: numerical robustness of the personalisation signal under hidden-state noise.

Why it matters

The catalogue's calibration depends on bets like this one. We had an extraordinary finding; the negative control showed it was less extraordinary than first read. We retracted the headline and tightened the framing. The production decision held under the new framing because the underlying effect was still real — just smaller. The methodology gained a rule: run noise-floor controls before claiming personalisation. The system gained a more honest description of what its adapter format actually does. The reader gains the calibration: the wins in this catalogue have survived their negative controls; the framings are what the data supports, not what we hoped to find.