Bet 55 — Multi-adapter logit ensemble (FALSIFIED)

A federation has many per-user adapters. Suppose a new, unknown user shows up — someone who doesn't have their own adapter yet. Can we ensemble across all the existing per-user adapters' logits and produce a federation-level output that's better than the no-adapter base model? If yes, the federation has a "general fallback" mechanism that exploits all the personalisation work done so far. If no, the federation needs a different strategy for unknown users.

The bet falsifies the ensembling approach. The ensemble of per-user adapters is slightly worse than the no-adapter base on an unknown user. The personalisation done for known users does not aggregate into a useful general adapter via output-space ensembling. The federation needs routing to the right per-user adapter; it cannot substitute routing with a federation-wide ensemble.

Background — why this question matters

The federation's per-user-adapter model has a deployment edge case: a user who hasn't trained their own adapter yet. When such a user makes a query, the federation has three options:

  1. Use the base model only. Acceptable but doesn't exploit any of the personalisation infrastructure.
  2. Use a "general" adapter trained on a balanced cross-user corpus. Plausible but requires extra training; the question is whether such a general adapter is useful.
  3. Ensemble across all known users' adapters. The federation already has hundreds or thousands of adapters. Can their joint output handle an unknown user better than the base?

Option 3 is the bet's hypothesis. If it works, the federation gets a "free" general fallback — no extra training, just an ensembling computation at inference time. The mechanism would be the same Bet 04 mixture combiner used for combining specialists:

ensembled_log_prob[token] = logsumexp(per_adapter_log_prob[token]) − log(N)

Each adapter contributes its log-probability for the next token; the joint output is the log-sum-exp. By Jensen's inequality, the ensemble is bounded by the best individual adapter for any given token. If at least one per-user adapter happens to be a useful predictor for the unknown user's text, the ensemble will be at least as good as that one.

Hypothesis

For an unknown user's held-out text, the ensemble logsumexp(per_adapter_logits) − log(N) produces lower perplexity than the no-adapter base model.

Pre-registered criteria

  • STRICT: ensembled-adapter ppl < base-only ppl by ≥ 5% on unknown user.
  • LENIENT: ensembled ≤ base-only (i.e. ensemble at least matches base).
  • CATASTROPHIC: ensembled > base-only (ensembling makes things worse than no adapter at all).

The CATASTROPHIC bar would mean the federation's adapter ensemble is actively harmful for unknown users — worse than just using the base. That would be a hard signal that ensembling per-user adapters is the wrong primitive.

Setup

  • Three per-user adapters (programmer / novelist / scientist), each fine-tuned via norm-only on the corresponding user's training text.
  • One held-out "unknown user" — held-out text from a fourth user (a journalist-style fixture not used to train any of the three adapters).
  • For each token of the unknown-user eval text:
    • Run the base model and each of the three adapter-augmented models, producing four log-probability distributions.
    • Compute ensembled log-prob = logsumexp(adapter1, adapter2, adapter3) − log(3).
    • Compare to base-only log-prob.
  • Eval text: 1000 tokens of the journalist-style held-out text.

The bet is also evaluated on each known user's held-out text as a sanity check — the known user's own adapter should win against the ensemble (if it doesn't, the per-user adapter design is broken).

Result — CATASTROPHIC

| Eval text | Base-only ppl | Ensembled-adapter ppl | Best individual adapter ppl | |---|---|---|---| | Unknown user (journalist) | 1,824 | 1,917 | 1,856 | | Programmer (known) | 6,922 | (own adapter wins) | 4,716 | | Novelist (known) | 113 | (own adapter wins) | 124 | | Scientist (known) | 130 | (own adapter wins) | 138 |

The ensemble of adapters is slightly worse than the base on the unknown user. Specifically, 1,917 vs 1,824 — about 5% worse. The CATASTROPHIC bar fires: ensembling makes things worse than no adapter at all. The "best individual adapter" column shows that even the single adapter that happens to fit the unknown user's text best (1,856) doesn't beat base (1,824) — the unknown user's distribution is genuinely far from any of the trained adapters.

For the known-user sanity check, the own adapter wins by 5–29% margin (consistent with Bet 61's confusion-matrix result). The federation's per-user adapter design is correct for known users; it just doesn't generalise to unknown ones via ensembling.

Why it failed — the average-of-prescriptions analogy

Each per-user adapter is fine-tuned to its specific user's distribution. The adapter's job is to shift the base model's predictions toward what its user prefers. Different users prefer different shifts.

The ensemble averages the per-user shifts. The average of shifts tuned for individual users is not a good shift for an arbitrary new user.

A clean analogy: imagine you have N people, each with prescription glasses tuned to their specific astigmatism. If you average all N prescriptions and give the average to a (N+1)th person, they'll see worse than with no glasses — because the average prescription is tuned to nobody's eyes. The average corrects for an "average astigmatism" that no individual has.

The federation's ensembled-adapter has the same shape. Each adapter shifts the base toward one user's distribution. The average shift moves the base toward the centroid of the trained-user-distributions, which is not the unknown user's distribution. The base-only model, which makes no shift, does slightly better — at least it's neutral.

Mathematically, the logsumexp combiner's Jensen-inequality bound (no worse than the best individual contributor) only helps if at least one contributor is good. For the unknown user, no individual adapter is genuinely good (the best is still slightly worse than base). The Jensen bound is therefore not protective in this case — the ensemble is bounded by the best contributor, but the best contributor is itself worse than the alternative (no contribution at all).

Connection to other negative results — Bets 31, 54

This bet is part of a chain of falsifications about combining per-user knowledge:

  • Bet 31: averaging two specialists' weights (linear soup) fails. Specialists don't combine in weight space.
  • Bet 54: averaging multiple users' adapters into a "shared" adapter mostly fails. Adapter averaging across users doesn't produce a useful federation-wide adapter.
  • Bet 55 (this bet): ensembling per-user adapters' outputs for an unknown user fails. Output-space ensembling doesn't substitute for routing either.
  • Bet 61: the personalisation-vs-regularisation confusion matrix shows that own-adapter wins by 5–29% margin. Routing to the right per-user adapter is required.

The chain's collective lesson: the federation cannot deduplicate across users via averaging, ensembling, or any other "fold N adapters into 1" operation. Each user's adapter is genuinely tuned to that user's distribution; the per-user information doesn't aggregate into a useful general signal. Routing to the right adapter is the only mechanism that captures personalisation.

This is also why the federation's deployment cost story includes "one adapter per user" as an explicit line item. A 9 KB adapter per user × 215k Kerala IT@School users = ~2 GB of total adapter storage, which is small but nonzero. The federation can't compress this into a single shared adapter; it has to ship per-user adapters and route at inference time.

What this leaves open

  • A "general" adapter trained on a cross-user corpus. This is distinct from ensembling existing adapters. A general adapter would be trained from scratch on a balanced sample of all users' text — it would be a single adapter that's a useful predictor for any user's text, possibly outperforming the base. Plausible; not yet tested. Different from this bet.
  • Few-shot adapter selection. A short interaction with the unknown user (e.g. their first 50 tokens) could be used to select the closest existing adapter. This is a routing-at-the-margin question — not a substitute for per-user adapters, but a way to bootstrap the unknown-user case before their own adapter is trained. Latency-sensitive but plausible.
  • Hierarchical adapters. A coarse "cluster" adapter per user-cluster, plus a fine per-user adapter. The cluster adapter provides a fallback for unknown users (cluster them by some demographic or behavioural signal); the per-user adapter takes over once one is trained. Untested. Plausibly useful.
  • Bayesian adapter combination. A more sophisticated adapter combination — e.g. weighted by each adapter's log-likelihood on the unknown user's first few tokens — could outperform uniform-weight ensembling. Plausible but requires the few-shot information that this bet's setup doesn't provide.
  • Larger ensemble. This bet ensembled 3 adapters. Real federation deployment might have hundreds or thousands. Whether the law-of-large-numbers makes the ensemble useful at larger N is open. The mechanism (averaging per-user shifts) doesn't get fundamentally better with more users, but the noise might average out differently. Probably not enough to flip the result; worth checking.

Run command

PYTHONPATH=src python -m experiments.bets.55_logit_ensemble

Output: experiments/bets/results/55_logit_ensemble.json records the base-only and ensembled perplexities for the unknown user, the per-adapter perplexities (for the comparison column), and the per-token log-probability traces for the first 100 tokens (so you can see where the ensemble is losing).

  • Bet 04: mixture combiner. The same primitive that works for specialists doesn't work for cross-user adapter ensembling.
  • Bet 31: linear weight-soup falsified. Specialists don't combine in weight space.
  • Bet 54: cross-user adapter averaging mostly fails. The complementary "average instead of ensemble" version of this bet.
  • Bet 61: personalisation-vs-regularisation confusion matrix. Own-adapter wins by 5–29% margin — the positive evidence for routing.

Why it stays in the catalogue

The federation's routing layer is justified by this bet. A reader proposing "just ensemble all the adapters for unknown users" must encounter this falsification first. The ensemble doesn't substitute for routing. Together with Bet 54 (averaging fails) and Bet 61 (own-adapter wins by margin), the case for per-user routing is closed.

The catalogue's broader story about combining knowledge across users tightens with this bet. The federation has a clear pattern now:

  • Combining specialists (Bet 04 mixture combiner): output-space ensembling works because specialists are diverse capabilities and the ensemble is bounded by the best one.
  • Combining per-user adapters (this bet): output-space ensembling fails because per-user adapters are additive shifts toward individual distributions and the average shift isn't toward any user.

The structural difference between specialists and per-user adapters explains why one combines and the other doesn't. The federation's design encodes this distinction: it ensembles specialists at the output level (Bet 04), and it routes per-user adapters by user identity (the catalogue's per-user-adapter primitive).

The bet's falsification is what gives the routing decision its empirical grounding. Without this bet, the federation could plausibly have shipped output-space ensembling for unknown users; with it, the federation knows that approach is actively worse than no adapter at all, and ships routing instead.