Bet 54 — Cross-user adapter averaging

Two users have similar interests. Can their per-user adapters be merged into a single shared adapter that helps both? Average two users' norm-only adapters elementwise and check whether the averaged adapter improves held-out perplexity for both contributors versus their no-adapter baselines. The answer for the federation's regime is mostly no — averaging works for one user pair (novelist + scientist) and fails for two (any pair involving the programmer). The federation cannot deduplicate per-user adapters via averaging.

This bet matters because adapter storage at fleet scale isn't free. A 215k-student Kerala IT@School deployment carries 215k × 9 KB ≈ 1.9 GB of per-user adapters. If users could be clustered and an averaged adapter shipped per cluster, the storage might compress 10×–100×. The bet rules out this naive shortcut: averaging across users is destructive in 2 of 3 pair tests, and the federation must route per-user rather than averaging.

Background — why adapter averaging is tempting

The federation's per-user adapter scheme produces N adapters for N users. Each adapter is small (9 KB), but at fleet scale the total adapter storage is non-trivial:

  • Kerala IT@School fleet: 215k students × 9 KB = 1.9 GB total. Tractable but non-zero.
  • Hypothetical 10M-user federation: 10M × 9 KB = 90 GB. Becomes a real cost.

If users with similar interests (e.g. all medical professionals, all teachers, all coders) could share a single adapter representing their "user cluster," storage costs would compress dramatically. A federation with 100 user clusters and 10M users would need only 100 cluster-adapters × 9 KB = 900 KB total adapter storage. That's a 100,000× reduction.

The simplest cluster-adapter mechanism is averaging: take all users in a cluster, average their per-user adapters elementwise, ship the average to anyone in the cluster. The bet asks whether this works.

Hypothesis

For at least 2 of 3 user pairs, the pairwise-averaged adapter improves held-out perplexity on both contributors versus their no-adapter baselines.

The "improves on both contributors" framing is the key. An averaged adapter that helps user A but harms user B is destructive — user B would be better off with no adapter than with the averaged one. The bet requires both contributors to benefit.

Pre-registered criteria

  • STRICT: averaged adapter helps both contributors on ≥ 2 of 3 pairs.
  • LENIENT: helps both contributors on ≥ 1 pair.
  • CATASTROPHIC: averaged adapter is worse than no-adapter for both contributors on every pair (would mean adapter averaging is universally destructive).

Setup

  • Three users (programmer / novelist / scientist), each with their own norm-only adapter trained per Bet 49.
  • Three pairwise averages: (programmer + novelist), (programmer + scientist), (novelist + scientist). Each pair: average the two users' adapter parameters elementwise.
  • For each averaged adapter, evaluate on each contributor's held-out same-user text (1000 tokens each).
  • Compare to no-adapter baseline: same eval text, base model only.

Pairwise rather than three-way averaging is chosen because pairwise is the simplest case. If pairwise fails, three-way is unlikely to succeed (more averaging = more destructive smoothing).

Result — LENIENT PASS

| Pair | User A's ppl with avg vs no-adapter | User B's ppl with avg vs no-adapter | Both helped? | |---|---|---|---| | programmer + novelist | regression | regression | no | | programmer + scientist | regression | regression | no | | novelist + scientist | improvement | improvement | yes |

Pairwise averaging helps both contributors in 1 of 3 pairs. The pair that worked (novelist + scientist) shares more linguistic structure than the pairs that failed. Programmer text is structurally different from prose-style users (heavy on identifiers, keywords, structural tokens); averaging programmer's adapter with a prose user's adapter is destructive — the average pulls each user's adapter away from its tuned configuration without producing a useful intermediate.

LENIENT passes (1 pair works); STRICT fails (the bet required 2 of 3 pairs to work). The result is a calibrated "sometimes" rather than a clean "yes" or "no."

Why averaging works for novelist + scientist

The novelist and scientist users both have prose-style training text. The novelist's text is creative writing; the scientist's text is academic prose. Different vocabularies and structures, but the same broad linguistic family. Their adapters are tuned along somewhat similar directions in the parameter space — both adjusting the model toward longer-form prose with consistent grammar.

The average of two same-direction adapters is a third adapter in approximately the same direction, with smaller magnitude. The averaged adapter doesn't capture either user's specific vocabulary as well as their own adapter would, but it captures the broad-prose direction that helps both users somewhat.

Mathematically, the personalisation signals for novelist and scientist are correlated. Correlated signals average constructively — the average preserves the correlation while the noise (the parts that differ) cancels. The result is a useful averaged signal for both contributors.

Why averaging fails when programmer is involved

The programmer user's text is structurally different from prose. Heavy reliance on Python keywords (def, class, return), identifier conventions (snake_case, camelCase), structured tokens (parentheses, colons, semicolons). The programmer's adapter is tuned to bias the model toward these patterns.

Averaging the programmer's adapter with a prose user's adapter produces a third adapter that's pulled in two incompatible directions:

  • Toward programmer-style tokens (the programmer's contribution).
  • Toward prose-style tokens (the prose user's contribution).

The two directions are not just different — they're in many cases opposite. The programmer's adapter increases probability mass on (, :, identifiers; the prose user's adapter increases probability mass on the, ,, prose words. Averaging produces an adapter that does neither well — its bias is somewhere in between, which means the model's output is less coherent for either user than no-adapter at all.

Mathematically, the personalisation signals for programmer and prose-style users are anti-correlated on many channels. Anti-correlated signals average destructively — the average is closer to zero than either signal, and the model's output reflects neither personalisation. Both users get worse output than they would with no adapter.

What this rules out

  • Adapter deduplication via averaging. A federation cannot save adapter storage by clustering similar users and shipping one averaged adapter to the cluster — the average is destructive in 2 of 3 cases for the bet's small sample. At fleet scale, the destruction rate is likely higher (more diverse users → more cross-user direction conflicts).
  • Adapter merge as a cheap compression. Even where it helps (novelist + scientist), the gain over no-adapter is smaller than each user's own adapter would deliver. Averaging is a worse personalisation, not a better one.
  • Naive cross-user knowledge transfer. Averaging is the simplest form of knowledge transfer; the bet falsifies it for the federation's regime.

Connection to the broader cross-user-combination story

This bet is part of the chain of falsifications about combining knowledge across users:

  • Bet 31: linear weight-soup of two specialists. Falsified — averaging in weight space doesn't combine specialists.
  • Bet 54 (this bet): averaging two per-user adapters. LENIENT pass — works only when adapters are correlated.
  • Bet 55: ensembling per-user adapter outputs for an unknown user. Falsified — output-space ensembling also doesn't substitute for routing.
  • Bet 61: confusion-matrix on per-user adapters. Own-adapter wins by 5–29% margin — the positive evidence for routing.

The chain's collective lesson: the federation cannot cleverly combine per-user information at any layer (weight, adapter parameter, output logit). Each user's contribution is genuinely tuned to that user's distribution; the per-user information doesn't aggregate into a useful general signal via any naive combination. Routing to the right per-user adapter is the only mechanism that captures personalisation reliably.

What this leaves open

  • Hierarchical clustering. A federation that ships per-cluster adapters (where the cluster is small and linguistically homogeneous) might still work. Bet 54 only tested pairwise; cluster averaging across users in the same linguistic family (e.g. all programmer-style users) is open. Plausibly works for tight clusters; unclear at what cluster size it breaks down.
  • Cross-user knowledge transfer via better methods. Averaging is a crude form of transfer. Better approaches — factorisation (decompose adapters into "shared" and "user-specific" components), multi-task fine-tuning across user clusters, mixture-of-adapters per user — remain open research directions. The federation hasn't ruled these out; it's only ruled out the simplest version.
  • Adapter-of-cluster-mean. Instead of averaging post-hoc, train a single adapter on a cluster of users' combined training data. This is structurally different from averaging — the optimiser sees all users at once and can learn a representation that benefits all. Plausibly works for tight clusters. Open.
  • Bayesian adapter combination. Weight each user's adapter by their similarity to a target user. Less crude than uniform averaging. Untested.

These are all candidate research directions if adapter storage becomes deployment-critical. For the federation's current scale (Kerala IT@School at 1.9 GB of total adapter storage), per-user routing is operationally feasible without compression.

Run command

PYTHONPATH=src python -m experiments.bets.54_adapter_averaging

Output: experiments/bets/results/54_adapter_averaging.json records the per-pair averaged adapters' perplexities for both contributors, the no-adapter baselines, and the per-user own-adapter perplexities (for comparison).

  • Bet 31: linear weight-soup falsified. Same lesson at the specialist level.
  • Bet 55: multi-adapter logit ensemble. Output-space combination — also falsified.
  • Bet 61: personalisation-vs-regularisation confusion matrix. Own-adapter wins by 5–29% margin.
  • Bet 49: norm-only adapter shootout. The per-user adapter that this bet tries (and mostly fails) to combine across users.

Why it matters

The federation routes per-user. This bet is the evidence that routing is necessary — averaging doesn't substitute for it. Combined with Bet 61's confusion matrix, the conclusion is unambiguous: each user gets their own adapter, the federation routes inference to that specific adapter, and no shared-adapter shortcut delivers comparable quality.

The deployment cost story is therefore "one adapter per user, route at inference time." That's an explicit per-user storage line item in the federation's economics. For Kerala IT@School scale, the cost is 1.9 GB total — small enough to fit on any modern infrastructure. For larger federations (10M+ users), adapter storage starts to matter, and the open work on hierarchical clustering or better cross-user combination becomes more important.

The methodological lesson: negative results in the cross-user-combination space are load-bearing. The federation's design relies on routing being necessary; the bet provides the empirical case that routing is not optional. Without this bet (and Bet 31, Bet 55), the catalogue would have to argue that averaging might work but the federation just chose not to use it. With these bets, the catalogue can say averaging doesn't work for the federation's regime, and the federation's design follows from the empirical reality.

The 1-of-3 pass also leaves a small but interesting opening: tight clusters of structurally-similar users can share an adapter without destruction. This is the seed of a future federation feature — cluster-routing where users are first matched to a cluster (a lightweight classification on their first few tokens) and then to a specific adapter within the cluster. Whether this is worth building depends on whether per-user adapter storage becomes a real bottleneck. For now, per-user routing is the production answer.