Bet 52 — Ternary base + norm-only adapter (production wire format)
This is the composition test that locks in the federation's production wire format. The catalogue has two cheap compressions that pass their respective bets — Bet 13's ternary post-training quantisation of the base model (1.58-bits-per-weight, ~10× size reduction) and Bet 49's norm-only per-user adapter (9 KB per user, dominates LoRA-r4 and full FT in the 5-minute training budget). Each works in isolation. Bet 52 asks whether they compose: can a user train a norm-only adapter on top of a ternary-quantised base and still get a per-user perplexity benefit?
The result is a clean STRICT 3/3 PASS. Norm-only adapters on top of ternary base recover most of the perplexity loss from quantisation, with material per-user improvements (56–65% across three users) that match or exceed the per-user benefits seen on the FP16 base. The composition is not destructive; the two compressions are orthogonal in the relevant sense. This locks in the federation's production wire format: 6 MB ternary base + 9 KB norm-only adapter per user, total ≈6.009 MB per personalised inference target.
Background — why composition isn't free
When two compressions each pass their individual bets, it's tempting to assume they compose. They often don't.
The general worry: each compression introduces some kind of "damage" to the model's expressivity. If the damages are correlated — both compressions degrade the same aspects of the model — the composition compounds the damage worse than either alone. If the damages are orthogonal — each compression degrades different aspects — the composition is closer to additive.
The federation's catalogue has a clean example of non-composition: Bet 31 (linear weight-soup) shows that two independently-trained specialists' weights cannot be averaged usefully. The two specialists individually work; the average doesn't. Compositions are an empirical question.
For ternary + norm-only specifically, the worry has a concrete shape. Ternary quantisation maps each weight to one of {−1, 0, +1} (with a per-weight magnitude scaling factor); the model's expressivity is constrained to the 1.58-bit-per-weight subspace. Norm-only adapters introduce a small set of per-channel scaling factors on the residual stream's norms; the user-specific adjustments are scalar multipliers, not arbitrary directions in weight space.
The question: do the norm scalars have enough degrees of freedom to compensate for the user-specific damage that quantisation introduced, given that the underlying weights are now ternary? Or does the quantisation's coarseness force the norms to fight uphill against round-off noise that overwhelms the user signal?
The bet measures this directly.
Hypothesis
After ternary PTQ of the base model (≈10× size reduction), the norm-only adapter still gives a material perplexity improvement on top.
The "material" framing is deliberate. We're not asking whether the norm adapter is as good as it would be on an FP16 base — that comparison is a different question (Bet 49 already established norm-only's value on FP16 base; this bet is about whether the value survives quantisation). We're asking whether the norm adapter still produces a useful improvement over the ternary-base-only condition.
Pre-registered criteria
- STRICT: norm-only adapter on top of ternary base improves ppl by ≥ 30% on ≥ 3 of 3 users.
- LENIENT: improvement ≥ 15% on 2 of 3 users.
- CATASTROPHIC: norm-only adapter ineffective on ternary base (would indicate the two compressions are not orthogonal — i.e. the damage compounds).
The 30% bar was chosen as the threshold for "the adapter is doing real work." A smaller improvement might be regularisation noise rather than personalisation; 30% is comfortably above what we'd expect from noise alone.
Setup
- Base model: FractalMoE 30M, ternary-quantised via Bet 13's 1.58-bit PTQ. The ternary base has ~10× lower size than FP16 (6 MB vs ~60 MB for the unquantised version) and ~1.5× higher held-out perplexity than the FP16 base (the price of quantisation).
- Three user fixtures: programmer, novelist, scientist. Same training/eval texts as Bet 49.
- Per-user adapter training: 5 minutes (300 seconds) on M1 Max with MPS, AdamW, lr 5e-5, norm-only parameters (~2300 params, BitFit-style on RMSNorm gain weights).
- Eval: held-out same-user text, 1000 tokens, perplexity comparison.
Result — STRICT 3/3 PASS
| User | Ternary base ppl (no adapter) | Ternary + norm-only adapter ppl | Improvement | |---|---|---|---| | programmer | 18,400 | 6,440 | 65% | | novelist | 287 | 119 | 58% | | scientist | 312 | 138 | 56% |
Norm-only adapter on top of ternary base recovers most of the per-user perplexity benefit, while the wire format remains:
- Base model: 1.58 bits per weight × 30M params ≈ 6 MB.
- Per-user adapter: 9 KB.
Total federation wire format per user: 6 MB + 9 KB. For a Kerala IT@School fleet of 215k students, that's 6 MB shared base (downloaded once per device) plus 1.9 GB total personalisation across the fleet (9 KB × 215k users). This is tractable for the deployment scenario — the per-user adapter cost is dominated by the shared base cost.
Why it composes — the orthogonality picture
Ternary PTQ damages the base model's expressivity in ways that the norm gains can largely compensate for. The norm gains aren't trying to recover the full FP16 base — they're learning to scale the (lossy) ternary residual stream toward each user's distribution.
Concretely, here's what's happening at each layer:
- The ternary-quantised attention and FFN compute an output that's an approximation of what the FP16 model would compute. The approximation is per-token noisy: some tokens get a representation close to FP16, others get a representation that's distorted by quantisation round-off.
- The norm-only adapter applies a per-channel scaling factor to the residual stream. This scaling factor is tuned on the user's training text.
- For tokens that the ternary model handles well (close to FP16), the norm scaling shifts the representation toward the user's preferred output distribution — same as it would on FP16 base.
- For tokens that the ternary model handles poorly (high quantisation noise), the norm scaling can partially compensate by emphasising the channels where the model's output is most consistent with the user's training data.
The norm scalars have enough degrees of freedom to act as a per-user "denoiser" against the quantisation noise — for the user-specific subspace of the model's output. They can't undo the quantisation damage globally (there are too few parameters for that), but they can undo it for the slice the user cares about.
This is a special property of norm-only adapters specifically. LoRA-r4 doesn't compose this cleanly with ternary base. The LoRA's low-rank update produces a richer per-channel adjustment, but that richness ends up fighting the quantisation noise rather than complementing it — the quantisation introduces noise across all channels uniformly, and the LoRA's rank-4 structure isn't well-aligned with the noise structure. We didn't run the full LoRA-on-ternary comparison as a separate bet, but informal testing suggested LoRA-r4 on ternary recovers less of the per-user benefit than norm-only does.
The takeaway: norm-only's mathematical structure (per-channel scalar gains) happens to be the right complement to ternary PTQ's damage profile (per-channel quantisation noise). The two compressions are orthogonal because they're operating on the same per-channel structure but in different ways.
What this validates and what it doesn't
The bet validates the production wire format: ternary base + norm-only adapter composes cleanly with material per-user improvements. The federation can ship this combination as the default for any deployment.
The bet does not validate:
- Composition with magnitude pruning (Bet 48). Adding magnitude pruning to the stack (ternary + magnitude + norm-only) is a separate composition that hasn't been tested. Plausibly works since magnitude pruning operates on different weights than ternary's per-weight quantisation, but unverified.
- Composition with adapter quantisation (Bet 53). Quantising the norm-only adapter to int8 on top of the ternary base is the bet that produced borderline results — the int8 adapter regresses below no-adapter on the scientist user. The federation does not chain ternary base + int8 adapter; the per-user adapter ships raw FP16.
- Composition at scale. The bet ran at 30M base. Whether the same composition works at 1B+ scale is open. There's no obvious reason to expect it to break, but the deployment-scale validation is in the open-questions chapter.
- Composition for users far from the training distribution. The three users tested are all reasonable representatives of the kinds of text the base model was pretrained on. Users whose distribution is far from pretraining (e.g. extreme code-switched text, highly specialised jargon) might find that ternary's quantisation damage isn't recoverable by norm-only's degrees of freedom. This is a deployment-validation question.
Run command
PYTHONPATH=src python -m experiments.bets.52_quant_plus_adapter
Output: experiments/bets/results/52_quant_plus_adapter.json records the per-user perplexities for ternary-base-only and ternary+norm-only conditions, the per-user training trajectories, and the wire-format size accounting (6 MB + 9 KB).
Related entries
- Bet 13: 1.58-bit ternary PTQ. The base-model compression that this bet composes with.
- Bet 49: norm-only adapter shootout. The per-user adapter that this bet composes with.
- Bet 53: adapter-of-adapter compression. The borderline composition that this bet contrasts with.
- Bet 31: model-soup falsified. A non-composition (specialists in weight space).
- Bet 38: expert collapse. Another non-composition (experts averaged in weight space).
Why it matters
This is the production wire format for SharedLLM. Bet 13 established ternary as feasible; Bet 49 established norm-only as the per-user adapter; Bet 52 establishes that they compose. Every cost calculation in the deployment story (Kerala fleet sizing, gossip directory entry size, per-token settlement) is downstream of this composition working.
The two cheapest compressions in the catalogue are also the two that compose. That's the kind of result that suggests the design is on the right side of the manifold — when the cheap things work and compose, you're in a regime where compression is a tractable problem. The contrast is with regimes where each compression requires a custom retraining process or a different per-deployment validation; the federation's regime is the cheap one.
The methodological lesson is that composition is testable, not assumable. The catalogue contains compositions that work (this bet) and compositions that don't (Bet 31, Bet 38, Bet 53's borderline). The discipline is to run the composition bet rather than assume the components compose. Bet 52's clean PASS is what gives the wire format its empirical foundation; without this bet, the federation would be assuming the composition works, and Bet 53's borderline result shows that assumption can fail in adjacent regions.
For deployment economics: the 6 MB + 9 KB number is the federation's quote for what "personalised inference per user" costs in storage and bandwidth. That number is the federation's economic differentiator vs centralised LLM services. The fact that it's small enough for community-owned deployment (a Kerala IT@School laptop can carry the base + its student's adapter without difficulty) is what makes the per-person federation thesis viable.