Personalization at 1B+ scale

The norm-only adapter wins at 30M parameters on short clean-prose corpora. The federation's deployment story assumes this primitive scales — that 1B+ models with real-user heterogeneous data still admit per-user norm-only adapters that beat LoRA-r4 and full fine-tune. The assumption is untested. The bet's harness operates at 30M because that's what fits in a single laptop's memory budget for rapid iteration; 1B+ requires either a real cluster or a substantial compute budget that the harness doesn't have. Until the experiment runs, every claim about per-user adapters at deployment scale is extrapolation.

This open question matters because the federation's deployment economics depend on the per-user adapter being small (KB-scale, not MB-scale) and effective at scale. If norm-only's competitive edge collapses at 1B+ — if LoRA-r4 or full FT becomes necessary at scale — the federation's storage cost story shifts from "9 KB per user" to "MB per user," and the deployment math (Kerala IT@School at 1.9 GB total adapter storage; hypothetical 10M-user federation at 90 GB) becomes proportionally larger. The bet that gates the deployment thesis is "norm-only at scale."

What's been validated at 30M

The catalogue's positive results at 30M are strong:

  • Norm-only beats LoRA-r4 and full FT on personal-text perplexity (Bet 49, 3/3 STRICT pass across programmer/novelist/scientist users at the 5-minute training budget). The norm-only Pareto-dominates richer adapters in this regime.
  • 5 seeds × 3 eval texts replicates 15/15 (Bet 46). The result isn't a single-seed accident; the personalisation effect is robust across seeds and eval texts.
  • Personalisation signal survives the noise-floor control (Bet 60). Most of the apparent benefit was regularisation rather than personalisation, but a real personalisation signal exists at the 5-29% margin level.
  • Confusion matrix isolates the signal (Bet 61). Own-adapter wins by 5–29% margin against other-user adapters and base. Personalisation is real; it's just smaller than the headline ppl drop suggested.
  • Numerical robustness holds (Bet 63). Hidden-state Gaussian noise at σ_rel = 1e-1 doesn't destabilise personalisation. 4 orders of magnitude of headroom against fp16 round-off.
  • Composition with ternary base (Bet 52). Norm-only adapter on top of ternary-quantised base recovers most of the per-user benefit. The production wire format works.

Each of these is at 30M. The pattern is consistent and well-controlled. The pattern's scale invariance is the open question.

What might break at 1B+

The norm-only primitive has a few mechanical scaling concerns:

  • Norm gains have fewer degrees of freedom relative to the model. At 30M parameters, the norm-only adapter has ~2,300 trainable parameters (~0.008% of total model parameters). At 1B parameters, the same proportion would be ~80,000 parameters (~320 KB on disk in fp16). The relative size stays small, but the absolute capacity might be too coarse for personalisation at frontier scale — or it might be sufficient, since the per-user signal is itself small. Untested.
  • Larger models have more redundancy in their forward pass. Modern 1B-scale models have substantial parametric slack — many channels in the residual stream that aren't doing critical work for any given input. Norm-only adjustments to per-channel scales might hit a ceiling earlier — there's just less for the gain parameters to push around if many channels are already near-optimal for general-purpose generation. Or the slack might be exactly what makes per-user fine-tuning easy at scale. Untested.
  • Real-user data is heterogeneous in ways the bets-harness fixtures aren't. Programmer / novelist / scientist are stylistically distinct, but they're all clean English prose, all from text corpora, all with consistent grammar and vocabulary. Real users send code mixed with emojis, broken-grammar chat, multilingual text, URLs, emoji-heavy messages, voice transcripts with disfluencies. Whether norm-only's personalisation primitive captures the heterogeneity of real-user distributions is unmeasured.
  • Quantisation interaction at scale. Bet 52 showed ternary base + norm-only adapter compose at 30M. At 1B, the quantisation noise might dominate the per-user signal more thoroughly. Bet 63's 4-orders-of-magnitude noise budget at 30M is encouraging but extrapolating to 1B is uncertain.
  • Training-budget dynamics. At 30M, 100 SGD steps in 5 minutes is enough to fit a per-user adapter. At 1B, the per-step compute is ~30× larger; 100 steps becomes 2.5 minutes if hardware is comparably fast, or 30+ minutes on slower hardware. The "interactive personalisation" UX (Bet 10's deployment narrative) needs revisiting at scale.

What might hold up at 1B+

There are also reasons to expect norm-only's edge to survive or even strengthen at scale:

  • The personalisation signal is small to begin with. Bet 61's 5–29% margin is the actual size of personalisation; this is small relative to the model's total capacity at any scale. A small signal is captured by a small adapter; norm-only's small parameter count is well-matched to the signal's small magnitude.
  • Larger models have richer pretrained representations. A 1B model's intermediate representations encode more linguistic structure than a 30M model's. Per-user gain adjustments to these richer representations may be more effective per parameter than at smaller scale.
  • Regularisation effects scale similarly. Bet 60's finding that most of the benefit is regularisation rather than personalisation is a property of small-data fine-tuning, not of model scale. The regularisation effect should hold or improve at scale.
  • Per-user data remains small at scale. A user has the same amount of personal text regardless of model scale. Larger model fine-tuning on small-data is exactly the regime where norm-only's parameter efficiency wins. LoRA-r4 and full FT both have more capacity to overfit on small data; norm-only's constraint to per-channel scalars is a feature, not a bug, in this regime.

The bet has to run to see which set of arguments wins.

What the experiment looks like

A concrete protocol for the 1B+ adapter shootout:

  1. Base model: a 1B+ open-weights model. Reasonable candidates: Llama 3.2 1B (Meta), Qwen2.5 1.5B (Alibaba), Gemma 3 1B (Google), Phi-3.5 mini 3.8B (Microsoft). Quantised to ternary or 4-bit for federation deployment per Bet 13's protocol.
  2. User fixtures: real personal corpora — emails, chat logs, code, notes — from at least 10 users, anonymised, with stable held-out splits. This is the data side of the experiment, and the hardest part to obtain. The Kerala IT@School pilot (separate open question) is a path; user-volunteer data with appropriate consent is another.
  3. Adapter training: norm-only at 1B params (~80 KB per user) and LoRA-r4 (~3 MB per user) and full FT (~1 GB per user). Same 5-minute training budget per Bet 49. Compare on held-out same-user perplexity.
  4. Confusion matrix: train one adapter per user, evaluate every adapter on every user (10×10 = 100 cells). Read the diagonal. This is the Bet 61 protocol at scale; it isolates personalisation from regularisation.
  5. Negative control: same as Bet 60 — train norm-only on random-token noise, compare to real-text. The "real text beats random by N%" margin is the actual personalisation signal.
  6. Composition with ternary: Bet 52 protocol at scale — quantise the 1B base to ternary, then train per-user adapters on top, measure recovery.

The outputs of this protocol are scale-extension equivalents of Bets 49, 60, 61, 52 — a coherent picture of what per-user adapters do at 1B+.

Constraints and feasibility

The experiment is a substantial compute investment:

  • Per-user adapter training: ~5 minutes × 10 users × 3 adapter formats = ~2.5 hours per round. Manageable on a single A100 or M-series Pro.
  • Confusion matrix evaluation: 10×10 evaluation cells × 3 adapter formats = 300 evaluation runs. ~1 minute each on 1B = ~5 hours per round.
  • Multi-seed, multi-eval-text discipline: 3 seeds × 3 eval-texts per result = 9 runs per cell. Total compute scales by 9×. ~70 hours of compute for the full protocol.

The compute is feasible — a few days on a single GPU. The harder constraint is the user data. 10 anonymised user corpora with stable splits is not a trivial data-acquisition task. The Kerala IT@School pilot path needs to clear before this data is realistically available; alternatives include:

  • Volunteer-contributed data. Researchers and federation contributors who consent to share their own personal text. Small N, high consent quality.
  • Public synthetic personas. Use LLM-generated user personas, each with distinct style and topic preferences. Avoids the consent problem but doesn't validate "real-user heterogeneity" — it tests "synthetic-persona heterogeneity" which is a different thing.
  • Open code/document corpora segmented by author. Public Git repos, open-licensed writing collections. Real authors but limited data per author.

What we'd learn

If the diagonal advantage holds at 1B+ (own adapter wins by ≥ 5% margin on most users), the production-readiness story closes. Norm-only is the federation's per-user adapter format at any scale; the deployment economics (Kerala fleet at 1.9 GB total adapter storage) are real.

If it doesn't, the federation needs a richer adapter format than norm-only at scale. The catalogue's contingency plan would be:

  • LoRA-r4 as the production default at 1B+. 3 MB per user instead of 9 KB. Kerala fleet adapter storage becomes 215k × 3 MB = 645 GB. Significantly larger but still tractable for centralised storage; per-device this is mostly fine.
  • Hierarchical adapters. Cluster adapters (low-MB scale) per user-cluster + small per-user adapters for the within-cluster signal. The Bet 54 cross-user averaging result is mostly negative, but tight clusters of similar users may share an adapter usefully.
  • Adapter quantisation. Bet 53 was borderline at 30M; at 1B the per-user signal might be larger relative to quantisation noise, making int8 adapter compression viable. Untested.

The catalogue's role is to keep this contingency tree explicit. The federation hasn't committed to "norm-only forever"; it's committed to "norm-only is the production default while it works, and we have known fallbacks if it doesn't."

How this connects to other open questions

The 1B+ experiment depends on real-user data that the bets harness doesn't have. Getting that data is partly a partnerships problem (the Kerala IT@School pilot open question is the obvious path), partly a technical problem (privacy-preserving anonymisation, federated data collection), and partly a methodology problem (how to construct held-out splits from naturally heterogeneous user corpora).

The 1B+ experiment is also gated by real-WAN throughput to some degree — if the federation's network behaviour at 1B is dominated by per-token network costs, the personalisation analysis has to factor that in. At 30M, network costs are negligible; at 1B, they're potentially significant.

Until this experiment runs, every claim about the federation's per-user adapter is scoped to the 30M scale on clean prose. That scope is much narrower than the deployment story implies. Acknowledging the gap is what the Open Questions chapter is for.

Why this stays open

The bet hasn't run because the necessary inputs (user data, compute budget, evaluation infrastructure at scale) haven't been simultaneously available. Each one is achievable individually; the combination requires either a partnership (Kerala) or a substantial single-investigator commitment (compute + data + time). The catalogue's discipline is to be explicit about what's scoped and what isn't.

The catalogue's strongest current claim about per-user adapters is "feasible at 30M scale on clean prose, with the personalisation signal isolated by Bet 61 at 5–29% per user." That's a calibrated claim, not a marketing claim. The federation deployment story is built on it, with the explicit qualifier that the 1B+ extension is open work the catalogue hasn't completed.

When this question resolves — in either direction — it will be the most consequential update to the catalogue. Until then, the federation's per-user adapter story has a known scoping limit, and the catalogue is honest about it.