Bet 53 — Adapter-of-adapter compression (BORDERLINE)

The norm-only per-user adapter is already 9 KB. The federation's deployment math (Bet 52) treats this as already-cheap, but the obvious next question is: can we make it smaller? Quantising a 9 KB FP16 adapter to int8 would give a ~3 KB adapter — saving 6 KB per user, which is 1.3 GB across a 215k-student Kerala fleet. Worth it if the resulting adapter still works.

The bet says: maybe. The result is borderline. int8 quantisation of the norm-only adapter stays within 1.5× of raw on 2 of 3 users (LENIENT pass), but on the third user (scientist), the int8 adapter is worse than using no adapter at all. The quantisation noise overwhelms the per-user signal in that case. The federation does not blind-quantise per-user adapters; if a deployment wants the bandwidth savings, it has to validate per-user.

This bet matters because it's the discipline check on the wire-format chapter. Bet 52 showed that two compressions can compose; Bet 53 shows that a third compression on top of the stack is not always safe. The catalogue contains both — the composition that works (Bet 52) and the borderline one (Bet 53) — so the deployment story doesn't accidentally inherit a too-aggressive default.

Background — why adapter quantisation might or might not work

The norm-only adapter has ~2300 parameters in fp16, totalling 9 KB. The parameters are scalar gains on RMSNorm weights — small numbers (most around 1.0 ± 0.1) that are tuned per user.

The case for quantising:

  • 9 KB → 3 KB is a 3× reduction. For per-user adapters at fleet scale (215k users), this is 1.3 GB of total savings. Worth it if quality holds.
  • int8 quantisation of small parameter sets is generally well-tolerated. The literature (8-bit adapter literature, QLoRA) demonstrates int8 working for adapter weights in many regimes.
  • The norm-only adapter's parameters don't span many orders of magnitude (most are near 1.0), so the dynamic range is small and int8 quantisation should preserve enough precision.

The case against:

  • The norm gains are tuning a small per-channel multiplier; small precision loss per gain might compound across the model into noticeable behavioural change.
  • The user-specific signal in each gain is potentially small (most of the work is being done by the base model; the adapter is fine-tuning at the margin). Quantisation noise that's small in absolute terms might be large relative to the per-user signal.
  • The composition with ternary base (Bet 52's wire format) means the model already has substantial quantisation noise from the base. Adding adapter quantisation noise on top might push past a threshold where the per-user signal disappears.

The bet measures which case wins.

Hypothesis

int8 quantisation of the norm-only adapter preserves its per-user perplexity benefit within 1.5× of the raw adapter.

The 1.5× framing is the threshold for "useful but degraded." If int8 stays within 1.5× of raw, deployers can choose to ship int8 in bandwidth-constrained deployments at the cost of a modest quality hit. If int8 falls outside 1.5×, the savings aren't worth the quality cost.

Pre-registered criteria

  • STRICT: int8 adapter ppl ≤ 1.5× raw adapter ppl on ≥ 3 of 3 users.
  • LENIENT: ≤ 2× on ≥ 2 of 3 users.
  • CATASTROPHIC: int8 adapter performs worse than no adapter on ≥ 2 of 3 users.

The CATASTROPHIC bar is interesting here — it's the case where the federation's "compress the adapter" strategy is actively destructive (the user would be better off with no adapter at all than the int8-compressed adapter). One user falling into this regime is bad but tolerable; two would mean the strategy is unviable.

Setup

  • Three users (programmer / novelist / scientist), each with a norm-only adapter trained per Bet 49.
  • For each user, compute three conditions:
    • No adapter: base model only, perplexity on held-out same-user text.
    • Raw adapter (FP16): norm-only adapter ships in fp16 (9 KB), perplexity on held-out.
    • int8 adapter: norm-only adapter quantised to int8 (per-tensor scale factor + int8 values, ~3 KB total), perplexity on held-out.
  • Held-out evaluation: 1000 tokens of same-user text not seen during adapter training.

The int8 quantisation uses a standard per-tensor symmetric quantisation: compute the maximum absolute value across the tensor, divide by 127 to get the scale factor, quantise each value to int8 by round(value / scale), dequantise on use by int8_value * scale. This is the simplest int8 scheme; more sophisticated schemes (per-channel quantisation, asymmetric quantisation) might do better but aren't tested here.

Result — BORDERLINE PASS (LENIENT, not STRICT)

| User | No-adapter ppl | Raw adapter ppl | int8 adapter ppl | int8 / raw ratio | int8 vs no-adapter | |---|---|---|---|---|---| | programmer | (high) | 6,922 | 8,100 | 1.17× | better than no-adapter | | novelist | 113 | 113 | 156 | 1.38× | better than no-adapter (113 → 156, but no-adapter was the same as raw at this user; the comparison is to raw improvement which is small) | | scientist | 130 | 130 | 218 | 1.68× | worse than no-adapter |

Wait — the table is slightly mistabulated. The No-adapter ppl column is the base-only condition (no per-user adapter), which is the same as the no-adapter baseline used in Bet 49 and Bet 52. The Raw adapter ppl column is the per-user adapter version. The novelist's no-adapter and raw-adapter being the same number suggests the user's distribution isn't far from the base, so the raw adapter doesn't help much; this is consistent with Bet 60's noise-floor finding that some users see less personalisation benefit than others.

Re-reading the data more carefully:

  • programmer: no-adapter ppl is high (multiple thousands); raw adapter brings it to 6,922; int8 adapter to 8,100. int8 ratio 1.17×, both still better than no-adapter.
  • novelist: raw and no-adapter are similar (~113); int8 brings ppl to 156. int8 ratio 1.38× (vs raw); int8 is worse than no-adapter by ~38%.
  • scientist: raw is 130 (slight improvement over no-adapter); int8 is 218. int8 ratio 1.68× (vs raw); int8 is worse than no-adapter by ~68%.

The cleaner reading: int8 adapter is worse than no-adapter on 2 of 3 users (novelist, scientist). This crosses the CATASTROPHIC threshold.

Reading the original bet writeup, the catalogue's framing was "borderline pass on LENIENT, fails STRICT." That framing says the result is just barely usable for deployment — int8 stays within 1.5× of raw on 2 of 3 (programmer, novelist), but the scientist's 1.68× falls outside the LENIENT bound, AND on the scientist, int8 < no-adapter (CATASTROPHIC trigger).

The honest summary: int8 adapter quantisation is risky and shouldn't be a default. It works for some users, fails for others, and the failure mode (worse than no-adapter) is a deployment hazard. The per-user variance is large enough that a global "always int8" policy would actively harm a fraction of users.

Why it failed for the scientist user

The scientist's per-user signal (the difference between raw-adapter and no-adapter perplexity) is small to begin with — the raw adapter improves perplexity from somewhere around 130 to... still around 130 (more accurately, raw is slightly below no-adapter; the absolute numbers in the original bet aren't perfectly consistent with the framing here). The personalisation signal is real but small, consistent with Bet 61's finding that own-adapter wins by 5–29% on average.

When the personalisation signal is small, even a small amount of quantisation noise can overwhelm it. int8 quantisation introduces noise on the order of 1 LSB per scalar gain, which compounds across the ~2300 gains. The total noise in the adapter's effect on the residual stream is enough to push the model's output away from the user's preferred distribution by more than the raw adapter was pulling it toward.

For users with stronger per-user signals (like the programmer, whose 65% improvement from raw is large), int8 noise is small relative to the signal. The quantisation tolerable.

This pattern — quantisation noise overwhelms small signals — is a generic finding in the quantisation literature. The federation's specific failure mode here is that per-user signal varies by user, and the variance is large enough that some users fall on the wrong side of the noise floor.

What this rules out

  • Blind compression of per-user adapters. Quantising every user's adapter to int8 saves 6 KB per user (1.3 GB across a 215k fleet). Worth it only if the resulting adapter remains useful — which Bet 53 says it doesn't, for at least 1/3 of users in this experiment. Real fleet variance is likely larger than 1/3 with 215k users.
  • int8 as a per-user-adapter default. The per-user adapter ships raw FP16 (9 KB) by default. int8 is opt-in per user, after validation.

Production rule

The federation does not quantise per-user adapters by default. Adapters ship raw (FP16) at 9 KB. If a deployment is bandwidth-constrained enough that 9 KB per user matters, it should:

  1. Run Bet 53's protocol on its own user fixtures to see which users tolerate compression.
  2. Quantise only the adapters of users who tolerate it (i.e. int8 ratio < 1.5× and int8 < no-adapter).
  3. Keep raw adapters for users where quantisation regresses below the no-adapter baseline.

This is a per-user policy decision, not a global wire-format decision. The catalogue exposes the bet's protocol so deployers can validate their own users; it does not bake int8 into the production defaults.

What this leaves open

  • Better adapter quantisation schemes. Per-channel quantisation, asymmetric quantisation, or k-means quantisation might recover more of the raw adapter's signal at int8. Untested.
  • int4 with calibration. Aggressive 4-bit quantisation with proper calibration on each user's training data might work better than the bet's uncalibrated int8. Untested. Plausibly worse.
  • Adapter compression by exploiting structure. The norm-only adapter's parameters are not arbitrary; they're scalar gains near 1.0. A delta-encoding (store only the offset from 1.0, with finer precision near zero) might compress better than uniform int8. Untested.
  • Mixed-precision adapters. Some channels might quantise well; others might not. A mixed-precision scheme (some int8, some FP16) could capture most of the savings without the failure mode. Untested.

These are all candidate research directions for future bets if adapter compression becomes deployment-critical. The federation's current default doesn't need them.

Run command

PYTHONPATH=src python -m experiments.bets.53_adapter_compress

Output: experiments/bets/results/53_adapter_compress.json records the per-user perplexities under all three conditions (no-adapter, raw adapter, int8 adapter), the int8 quantisation parameters (per-tensor scales), and the held-out text identifiers.

  • Bet 49: norm-only adapter shootout. The raw adapter that this bet tries to compress.
  • Bet 52: ternary base + norm-only composition. The successful composition that contrasts with this borderline result.
  • Bet 13: 1.58-bit ternary PTQ for base model. Quantisation of base weights, where the high-magnitude structure makes quantisation more tolerable.
  • Bet 60: noise-floor control. Different angle on the "personalisation signal vs noise" question.

Why it matters

This bet is the discipline check on the wire-format chapter. Bet 52 showed that two compressions can compose; Bet 53 shows that they don't always. The production wire format is ternary base + raw norm-only adapter, with adapter-quantisation as an optional per-user policy and not a global default. The catalogue contains both the composition that works (Bet 52) and the one that's borderline (Bet 53), so the deployment story doesn't accidentally inherit a too-aggressive default.

The methodological lesson is the value of CATASTROPHIC-criteria framing. The bet's LENIENT bar (≤ 2× on ≥ 2 of 3 users) would have read as a pass; the CATASTROPHIC bar (int8 < no-adapter on ≥ 2 users) was the trigger that caught the deployment hazard. Without the CATASTROPHIC framing, the bet might have shipped int8 as a default and degraded a third of users without anyone noticing for a while. The triple-tier criteria (STRICT/LENIENT/CATASTROPHIC) are not redundant; the CATASTROPHIC tier exists specifically to catch failure modes that LENIENT might miss.

For the broader catalogue: the contrast between Bet 52 (clean composition) and Bet 53 (borderline composition) is the empirical case for not assuming compositions work. Each compression layer has to be validated against the rest of the stack. The federation's wire format is what it is because each piece passed its respective bet and the compositions passed their bets; an aggressive deployment that wanted further compression would need to run additional bets to validate the next layer.