Bet 49 — Adapter shootout (norm vs LoRA vs full FT)

The head-to-head that turned norm-only from "interesting result in a single bet" into "production default for the federation's per-user adapter format." Three distinct user fixtures. Three adapter methods. One configuration of compute (5 minutes, 100 steps, lr 5e-5 across all three methods). Read the result, pick a default, ship it.

This bet exists because previous results gave a partial picture. Bet 37 had shown that norm-only beats full FT by 6× to 100,000× on a single user fixture. Bet 46 had replicated that across 5 seeds × 3 eval texts (15/15 STRICT). Both bets were strong, but neither answered the production question directly: of the three serious candidate adapter formats, which one wins, on which users, by what margin, and with what tradeoff? That is what Bet 49 measured.

The three candidate formats

Three structurally different parameter-efficient fine-tuning approaches:

Norm-only. Train only the gain parameters of every RMSNorm layer in the model. ~2,300 params at 30M scale, 9 KB on disk. Restrictive structural prior: per-channel scalar multiplier on residual norms. Cannot rotate the residual stream, can only rescale.
LoRA-r4. Add a rank-4 update to each attention projection (Q, K, V, O) and FFN matrix. ~25,000 params at 30M scale, 96 KB on disk. Less restrictive: low-rank update can rotate the projections. The de facto standard parameter-efficient fine-tuning method since 2021.
Full FT. Fine-tune every parameter. ~38M params at 30M scale, 155 MB on disk. No restriction. Maximum capacity, maximum risk of overfit on small corpora.

The three formats span 4 orders of magnitude in parameter count and 4 orders of magnitude in disk size. They are not alternative parameterisations of the same thing — they're qualitatively different choices about which parts of the model are allowed to update. The shootout asks which choice produces the best held-out perplexity at the same compute budget.

The three user fixtures

The bets harness has standardised on three user fixtures designed to span different language distributions:

Programmer — Python source code, technical documentation, code comments, API references. Heavy in punctuation, identifiers, and structured tokens. Vocabulary skewed toward CamelCase, snake_case, and library-specific names.
Novelist — long-form fiction, descriptive prose, dialogue, character-internal narration. Heavy in adjectives, subordinate clauses, and figurative language. Vocabulary skewed toward emotional register and sensory description.
Scientist — academic abstracts, methods sections, results paragraphs, citation-rich prose. Heavy in passive voice, hedged claims, and domain-specific terminology. Vocabulary skewed toward Latin-derived technical terms and statistical phrasing.

These fixtures were constructed (by hand, from public-domain corpora) precisely to test whether a personalisation primitive generalises across stylistically distinct users. If norm-only worked only for prose-style users, it would not be a defensible federation default; the federation's user base spans all three styles and many more.

Pre-registered criteria

The bet's STRICT criterion required norm-only to Pareto-dominate the other two methods, where Pareto-domination means: lower personal-text perplexity AND smaller adapter size AND no worse general-text degradation than either alternative.

STRICT: norm-only Pareto-dominates LoRA-r4 and full FT for all 3 of 3 users.
LENIENT: Pareto-dominates for 2 of 3 users.
CATASTROPHIC: Pareto-dominates for 0 of 3 users (would falsify the production-default thesis).

Pareto-domination is a stricter test than just "lower ppl." A method could have lower ppl but worse general-text degradation, in which case it wouldn't dominate. We required all three criteria simultaneously.

Result — STRICT 3/3 PASS

The headline numbers, held-out same-author perplexity:

| User | Norm-only ppl | LoRA-r4 ppl | Full FT ppl | |---|---|---|---| | programmer | 6,922 | 8,140 | 84,000+ | | novelist | 113 | 198 | 91,000+ | | scientist | 130 | 245 | 400,000+ |

Norm-only wins on every user. The margin against LoRA-r4 is modest (1.18× to 1.88×); the margin against full FT is overwhelming (12× to 4,000×).

The general-text Pareto check, measured on a held-out general corpus (no per-user signal):

| User adapter | Norm-only general ppl | LoRA-r4 general ppl | Full FT general ppl | |---|---|---|---| | programmer | 142 | 148 | 8,400+ | | novelist | 138 | 145 | 12,100+ | | scientist | 140 | 152 | 18,800+ |

Norm-only's general-text degradation is comparable to LoRA-r4 (within 5%) and dramatically better than full FT. The Pareto-domination is clean: norm-only is at least as good as LoRA-r4 on personal text and on general text, while being 10× smaller; norm-only beats full FT on both axes by orders of magnitude.

Why each method behaves the way it does

Full FT behaves the way the parameter-to-data ratio would predict. 38M parameters against a 5-minute corpus is a textbook overfitting setup. The model rapidly memorises the training corpus and loses generalisation; the held-out personal text isn't even in the memorised corpus, so the model produces near-noise for it. The general-text catastrophe (8,000× to 18,000× ppl) is the unmistakable signature of overfit. Full FT could be made to work with longer training, more data, better regularisation, or all three — but at a fixed 5-minute budget it's strictly worse than the parameter-efficient alternatives.

LoRA-r4 behaves close to ideally for its capacity. ~25,000 parameters is enough to learn the per-user distribution shift without overfitting on the 5-minute corpus. Held-out personal-text ppl is in the right neighbourhood, general-text degradation is small, adapter size is reasonable. LoRA-r4 is a perfectly defensible production choice for many federation deployments — it's just slightly worse than norm-only on these fixtures, and 10× larger.

Norm-only behaves better than its raw parameter count would predict. With only 2,300 parameters, you'd expect it to underfit. Instead it generalises beautifully on the held-out same-author text, and its general-text degradation is the smallest of the three. The structural prior — "per-channel scaling of residual norms is the right adapter geometry for personalisation" — turns out to be unusually well-aligned with the actual structure of per-user distribution shifts. Why this works at all, mathematically, is partly an open question; the empirical answer is "it does."

What this does not rule out

This shootout was at a fixed 5-minute, 100-step compute budget. The results probably look different at:

Longer training. With many hours of training and orders of magnitude more data, full FT becomes competitive again. The federation deployment scenario is "5 minutes on the user's laptop", which is what we tested. Datacentre-scale fine-tuning is a different regime.
Larger base models. At 30M parameters, norm-only's 2,300-param adapter has ~0.008% of the model's parameter count. At 1B+ parameters with the same proportion, the adapter has ~80,000 params (~320 KB). Whether the structural prior still holds at that scale is the 1B+ scale open question. The bet does not validate it.
More heterogeneous users. The three fixtures (programmer / novelist / scientist) are stylistically distinct but all clean prose. Users sending mostly emoji, mostly URLs, mostly code-switched multilingual, or mostly broken grammar may behave differently. This is what the Kerala IT@School pilot (Open Questions chapter) is designed to measure.
Different adapter compositions. Bet 58 found that single-layer last-block steering vectors beat norm-only by 32% for the programmer user (while losing on the other two). The adapter design space is bigger than the three formats tested here. We picked norm-only as the production default because it's the best single-format choice across users; per-user adapter format selection is a possible future extension.

What the federation does with this result

Bet 49 settled the production per-user adapter format. The decision documented:

Default per-user adapter format: norm-only fp16, ~9 KB. No further compression by default (Bet 53 showed compression is borderline).
Composes with: ternary base quantisation (Bet 52, STRICT 3/3). The wire format is ternary base + raw norm-only adapter at ~6 MB shared + 9 KB per user.
Routing required, averaging forbidden. Bets 54 and 55 showed averaging across users is destructive and ensembling for unknown users doesn't generalise. The federation must route each user to their specific adapter; there's no shortcut via merging.
Glass-box compatible. The norm-only adapter participates in the mixture combiner (Bet 04) without breaking the per-token reconciliation property (Bet 18).

Every economic argument in the SharedLLM deployment story — Kerala IT@School fleet sizing, gossip directory entry size, settlement granularity, privacy and revocation — flows from this bet's outcome. If norm-only had lost the shootout, the federation would default to LoRA-r4 (workable, 10× larger) and the scaling math for community-owned deployment would be tighter.

What this means for contributors

Anyone proposing a new adapter format for the federation needs to clear this shootout's bar:

Run on the three fixtures (programmer / novelist / scientist).
Hold out same-author text as the eval set.
Measure both personal-text ppl AND general-text ppl.
Compare against norm-only at the same compute budget.

If the new format Pareto-dominates norm-only — lower personal-text ppl, smaller (or equal) adapter size, no worse general-text degradation — it's a candidate replacement. The catalogue welcomes such submissions; the methodology requires them to clear this shootout's bar to enter the production wire format.

Run command

PYTHONPATH=src python -m experiments.bets.49_adapter_shootout

Output: experiments/bets/results/49_adapter_shootout.json, with three trained adapters in out/49_norm_<user>.pt, out/49_lora_<user>.pt, out/49_full_<user>.pt. The trained norm-only adapters from this bet are loaded by Bets 52, 53, 54, 55, 60, 61, 63 — this bet's outputs are reused throughout the catalogue.

Bet 37: the original norm-only result (single seed).
Bet 46: 15/15 replication across 5 seeds × 3 eval texts.
Bet 58: alternative primitive (last-layer steering); wins for programmer only.
Bet 60: noise-floor negative control. Real beats random by smaller margin than Bet 37's framing implied.
Bet 61: confusion matrix. Own-adapter wins by 5–29% — the cleanest evidence of personalisation as a real signal.
Bet 52: norm-only composes with ternary base quantisation.
Bet 53: int8 of the adapter is borderline.
Bet 54, 55: averaging and ensembling don't substitute for routing.

Why it matters

A federation can ship many things at the per-user layer. It can ship the entire fine-tuned model, or a low-rank update, or a tiny set of scalars, or a steering vector, or a soft prompt, or some combination. The bet exists because picking matters, and picking blindly is wasteful — adapters at 10× larger than necessary hurt federation economics by 10×, on a number that scales with user count. By picking norm-only based on three-user evidence, the federation accepts a 9 KB per-user cost as the deployment baseline. Every downstream economic decision (storage, routing, settlement) is predicated on that 9 KB number, and that number traces back here.