Bet 10 — Specialist per person

The bet that anchors the per-person federation thesis. Can each user have their own personal specialist, trained in seconds on a laptop, without destroying the model's general capabilities? At the 30M scale on clean prose corpora, yes — held with caveats that later bets sharpened materially.

The federation's deployment story depends on this question's answer. If specialist-per-person required a datacentre and four hours of GPU time, the federation could only personalise for a handful of premium users. If it requires five minutes on a laptop, the federation can personalise for every Kerala IT@School student in the fleet. The numerics here matter for the deployment economics in a way that few other bets do.

Background — why per-person personalisation matters

The federation thesis is that models should be owned by the community, not by the platform. The minimum bar for this thesis is that personalisation — the customisation that makes an LLM useful to you specifically — happens on hardware you control, with data that doesn't leave your device. If personalisation requires uploading your data to a centralised training service, the model is no longer "yours" in any meaningful sense.

The classic ML answer to "personalise a model" is fine-tune it. Full fine-tune on a per-user corpus produces a per-user model. But full fine-tune at the typical scale (1B+ parameters) requires:

Several minutes of GPU time per user. Acceptable in absolute terms but expensive at fleet scale.
A few hundred MB of disk per user. Unacceptable at fleet scale (215k users × 500 MB = 100 TB).
A non-trivial training pipeline on the user's device. Acceptable for hobbyists, not for typical Kerala IT@School laptop users.

The bet asks: under more aggressive resource constraints — a 30M-param base, 5 minutes of training, 100 SGD steps, lr 5e-5 — does fine-tuning still produce a useful per-user specialist?

Hypothesis

Five minutes of fine-tuning on user-specific text produces a personal specialist with materially lower perplexity on held-out same-user text, without destroying general capability. The thresholds:

Personal-text perplexity drops by ≥ 10×.
General-text perplexity rises by < 50%.

These thresholds were chosen with the deployment scenario in mind. A 10× personal-text drop is large enough to be meaningful (the model is noticeably better at predicting the user's writing); a 50% general-text rise is small enough that the model remains useful for non-personal tasks.

Pre-registered criteria

STRICT: personal-text ppl drops > 10× while general-text ppl rises < 50%.
LENIENT: personal ppl drops > 2× while general rises < 100%.
CATASTROPHIC: personal ppl unchanged, or general ppl rises > 5× (would mean the personalisation destroys the base model).

Setup

The classic Bet-37-style setup, but recorded as Bet 10 for chronological reasons (the catalogue numbers reflect when bets were run, not their dependency order). The bet was actually one of the earliest in the catalogue, predating the norm-only / LoRA / full FT shootout.

Model: FractalMoE 30M, programmer-fixture base.
Adapter: full fine-tune (~38M params). The norm-only / LoRA primitives weren't yet known to the catalogue at the time of this bet.
Optimiser: AdamW, lr 5e-5, 100 steps.
Wall-clock: ~5.6 seconds on M1 Max with MPS backend (the bet pre-dated the standardisation on 5-minute training; this run was deliberately shorter to test the floor of feasibility).
Personal text: ~3,000 tokens of user-specific content (drawn from the bets harness fixture).
Held-out personal text: a separate ~500-token slice from the same user, never seen during training.
General text: a held-out general corpus (Common Crawl slice) measured for capability degradation.

Result — STRICT PASS

The numbers:

| Metric | Before fine-tune | After fine-tune | Change | |---|---|---|---| | Held-out personal ppl | 114 | 1.04 | ~110× drop | | Held-out general ppl | 89 | 111 | +25% rise | | Wall-clock to train | — | 5.6 seconds | — |

Personal-text ppl drops by ~110× (well above the 10× STRICT bar). General-text ppl rises by 25% (well below the 50% STRICT bar). Wall-clock is 5.6 seconds — comfortably within "human-tolerable" for an interactive personalisation flow.

This was the bet that made the "specialist per Kerala student" thesis numerically plausible. 215,000 students × 5.6 seconds = ~14 days of total compute, parallelisable across the fleet itself (every student's laptop trains its own adapter). The math fits.

Caveats added by later bets

The headline number — personal ppl 114 → 1.04 in 5.6 seconds — survives, with material sharpening from later bets. Three follow-ups particularly affect the framing:

Bet 43 ran 5 seeds × 3 eval texts to confirm the result wasn't a single-seed accident. STRICT 15/15 PASS. The personalisation effect is replicable across seeds and eval texts.

Bet 60 ran the negative control: train the same adapter on random uniform tokens instead of real text. Random text beat real text for the programmer user; real won 2/3 by 1.10–1.36× margin. The framing tightened from "personal text fits the user" to "real text fits the user more than random text fits, by a measurable but not-overwhelming margin." Most of what looked like personalisation in this bet was actually regularisation — the adapter format is small enough that even random training produces a held-out predictor that's better than the base model.

Bet 61 ran the personalisation-vs-regularization confusion matrix: train one adapter per user, evaluate every adapter on every user. Own adapter wins by 5–29% margin per user. This is the cleanest evidence that personalisation is real signal, isolated from the regularisation effect. The 110× personal-ppl drop in this bet (Bet 10) is dominantly the regularisation effect; the actual personalisation component is more like the 5–29% margin Bet 61 measures.

The corrected framing: this bet's 110× drop is real but mostly explained by full fine-tune being a bad baseline (it would have overfit on a 5-minute corpus had we been comparing). The personalisation signal is real but smaller. The federation deployment economics still work because the regularisation effect is useful — even if it's not "personal AI", it's "an adapter that's a better held-out predictor than the base model on this user's text."

Why this bet uses full fine-tune (in retrospect)

This bet ran full fine-tune because the catalogue hadn't yet developed the norm-only / LoRA / steering primitives. The result is dramatic in part because full FT at 100 steps on 5 minutes of data happens not to overfit catastrophically at this very small step count. With 1000 steps (which Bet 37 used as the baseline), full FT would have collapsed.

The bet's real result, viewed through the lens of later catalogue work, is "personalisation is feasible at the 5-second / 100-step budget, even with a primitive (full FT) that overfits at longer budgets." This is a weaker claim than "full FT works"; it's a claim about the boundary between underfit and overfit, where personalisation is captured before catastrophic overfit takes over.

For production deployment, the right primitive is norm-only (Bet 49 production default), not full FT. The federation's per-user-specialist story is now framed in terms of norm-only adapters at 9 KB, not in terms of full fine-tune. Bet 10's headline survives as "5 seconds of training is enough to capture the personalisation signal", regardless of which primitive is used.

What this enables in deployment

Three concrete production implications:

Per-user adapter training is interactive-fast. A user can train their own adapter in 5 seconds, see the result, and iterate. This makes personalisation a UX-first feature rather than a server-side batch job.
Fleet-scale personalisation is tractable. 215,000 Kerala IT@School students × 5 seconds = ~14 days of total training time, distributed across the fleet itself. Each student's laptop trains its own adapter; no central training infrastructure needed.
Adapter staleness has a low refresh cost. If a user's distribution drifts over time (new vocabulary, new interests, new writing style), retraining the adapter is a 5-second operation, not a multi-hour pipeline. This lets the federation refresh adapters frequently — e.g., daily — without significant resource cost.

What this leaves open

1B+ scale. Bet 10's 5-second training is at 30M params. At 1B+ params, the per-step compute is ~30× larger. 5 seconds becomes 2.5 minutes. Still tractable, but the claim shifts from "interactive-fast" to "few-minute background job." Open work for the 1B+ scale question.
Heterogeneous corpora. The bet used a single user's clean prose corpus. Real users send code, emojis, URLs, broken grammar, multilingual mix. Whether 5 seconds is enough on noisy real-user data is the Kerala IT@School pilot's main empirical question.
Catastrophic forgetting under long-running personalisation. If an adapter is retrained daily over a year, does the base model's general capability degrade in ways the bet's "general ppl rises < 50%" check doesn't catch? Bet 29 (sequential personalisation) is a partial answer; full multi-month behaviour is open.
Adversarial corpora. A user whose corpus contains adversarial prompts (e.g., trying to weaponise the adapter as a backdoor) is untested. The federation's threat model needs to address this; this bet doesn't.

Run command

PYTHONPATH=src python -m experiments.bets.10_specialist_per_person

Output: experiments/bets/results/10_specialist_per_person.json records the personal-ppl trajectory across training steps, the general-ppl measurements before and after, and the wall-clock timings.

Bet 37: norm-only fine-tune. The production-default primitive for per-user adapters.
Bet 43: 15/15 replication of this bet across seeds and eval texts.
Bet 60: noise-floor negative control. Most of the 110× drop is regularisation, not personalisation.
Bet 61: confusion matrix. Isolated personalisation signal at 5–29% per user.
Bet 49: adapter shootout. Norm-only Pareto-dominates full FT at 5-minute budget.
Bet 28: sample efficiency curve. How much training data is enough?
Bet 29: sequential personalisation. What happens when a user's adapter is retrained over time?

Why it matters

This bet gates the per-person federation thesis. If specialist-per-person had not worked at the 30M scale in seconds-to-minutes wall-clock, RFC-0006 would not have moved past concept. Because it did work — even with a primitive (full FT) the catalogue later replaced for production reasons — the rest of the wire-format and adapter chapters are about how to deploy it economically, not whether to.

The open question — does this hold at 1B+ on real-user data? — is in the Open Questions chapter. Until that's answered, the catalogue's strongest personalisation claim is "feasible at small scale on clean prose, with the personalisation signal isolated by Bet 61 at 5–29% per user." That's a calibrated claim, not a marketing claim. The federation deployment story is built on it.