Bet 04 — Speculative federation (mixture combiner)

The most important math result in the catalogue, and the bet that determines what "federation of specialists" actually means at the per-token level. Two specialists disagree about what the next token should be. How do you combine their per-specialist log-probabilities into a single federation log-probability that is both mathematically defensible and behaviourally useful?

The question is not academic. Every inference step in a federation of multiple specialists has to make this choice. If the choice is wrong, the federation can produce output that is worse than any of its constituent specialists individually. We measured exactly that, on the wrong choice, in this bet.

Two candidate combiners

The two natural-feeling options:

Sum of log-probabilities. Treat the specialists as independent observers of a ground-truth distribution. The joint log-probability of token t is Σ log p_i(t), where p_i is the i-th specialist's per-token distribution. Equivalently in probability space, p_joint(t) = Π p_i(t) / Z, where Z is the normalising constant. This treats each specialist as a noisy independent measurement of the truth and multiplies the marginal probabilities to get the joint.
Mixture (a.k.a. soft union). Treat the federation as a single random model that's been drawn uniformly from the N specialists, and ask for the probability under that mixture distribution. The joint log-probability of token t is log((1/N) Σ p_i(t)) = logsumexp(log p_i(t)) − log N. This treats the specialists as alternative models that the federation is averaging over, not as independent observations of a single truth.

Both options require the per-specialist log-prob vectors to be available, which the bets harness already provides. Both are computationally cheap. The decision between them is purely about correctness.

Why sum-of-log-probs looks reasonable at first

The temptation to use sum-of-log-probs is real. It comes from the standard derivation: if you assume specialists are conditionally independent given the true next token, then p_joint(token | context) ∝ Π p_i(token | context). Sum-of-log-probs is the log of that product. It's the canonical Bayesian-update formula for combining independent likelihoods.

It also has a satisfying behaviour at the agreement extreme: if all specialists assign probability 0.95 to the same token, the joint assigns probability ~1.0. This concentrates probability on the agreed-upon token in a way that feels right.

Why sum-of-log-probs is catastrophic in practice

Specialists are not conditionally independent. Centrally-trained specialists share an architecture, often share a base model, and have correlated mistakes. Worse, they are reliably wrong about each other's domains. A code specialist will assign very low probability to the next token of a poem; a poetry specialist will assign very low probability to the next token of a Python function definition. Multiplying low probabilities produces vanishingly low joint probabilities — and there's no token at which the multiplication converges to a reasonable value.

The empirical result on a held-out mixed-domain text:

| Combiner | Joint perplexity | |---|---| | Best single specialist | 197 | | Worst single specialist | 339 | | Sum of log-probs | 1,277 | | Mixture (logsumexp − log N) | 156 |

Sum-of-log-probs made the joint output worse than the worst single specialist by a factor of ~3.8×. The federation is actively harmful under sum-of-log-probs combining.

The intuition for the failure: if specialist A says "next token is 'cat' with probability 0.9" and specialist B says "next token is 'dog' with probability 0.9," sum-of-log-probs assigns:

p_joint(cat) ∝ 0.9 × 0.1 = 0.09
p_joint(dog) ∝ 0.1 × 0.9 = 0.09
All other tokens get something even smaller.

After normalisation, the joint distribution over {cat, dog, *} is approximately uniform. Worse, the unnormalised joint mass is tiny, and the normalising constant Z ends up dominated by tokens neither specialist intended — including very long-tail tokens that both specialists assigned nonzero but small probability to. The joint distribution contradicts both specialists' marginals because each specialist individually rules out the other's chosen token.

This is the disagreement pathology. It's not a small effect at the boundary; it's the dominant behaviour any time the specialists' top-k tokens don't overlap. And in mixed-domain federation, top-k tokens routinely don't overlap.

Why mixture works

The mixture combiner says: pretend the federation is a random variable that picks a specialist uniformly at random and asks that specialist for its distribution. The federation's distribution over tokens is then (1/N) Σ p_i(token | context). In log space, this is logsumexp(log p_i) − log N.

The key mathematical property is Jensen's inequality applied in the right direction:

log((1/N) Σ p_i) ≥ (1/N) Σ log p_i

The mixture log-prob is bounded below by the average of the per-specialist log-probs. This means: the joint perplexity under mixture is bounded above by the geometric mean of the per-specialist perplexities. In particular, the mixture cannot perform worse than the worst specialist by more than a factor of N (and in practice, much better than that).

A stronger bound: when one specialist confidently assigns high probability to the correct token and others don't, the mixture concentrates on that specialist's prediction. The mixture log-prob converges to log(p_best) − log N; for large p_best, this is very close to log p_best. So the mixture is "free-riding" on whichever specialist happens to be right — without you having to know in advance which one it is.

In the cat/dog disagreement scenario:

p_mixture(cat) = (0.9 + 0.1) / 2 = 0.5
p_mixture(dog) = (0.1 + 0.9) / 2 = 0.5
All other tokens get something around 0 / 2 = 0.

The mixture splits evenly between the two specialists' choices, with no contradiction of either marginal. The joint distribution is a clean blend of the two component distributions.

Empirical validation

Joint ppl under mixture: 156. Compare to:

Best single specialist: 197.
Average single specialist: 268.
Worst single specialist: 339.
Sum-of-log-probs: 1,277.

The mixture outperforms the best single specialist on the held-out test, by 26%. This is the free-riding property in action — the mixture lets each specialist contribute on the tokens it's confident about, and the federation as a whole performs better than any constituent.

This is the strongest possible result for this bet: the federation's combined output is not just safe but actively better than any individual specialist. It justifies the entire federation premise. If the joint output were merely "as good as the best specialist," federation would still be useful for routing and load-balancing, but it wouldn't have the per-token quality story. With mixture combining, it does.

Production rule

Use the mixture combiner. Do not use sum-of-log-probs. Sum-of-log-probs is only safe when specialists have been explicitly trained for conditional independence — which centrally-trained specialists are not, by construction. The federation's mixture_combine() function in src/sharedllm/inference/combiner.py implements logsumexp(log_probs) − log(N) and is the only combining path that production code is allowed to take.

The bet does not preclude future combiners. There's room for weighted mixtures (where the weights are learned per-context), expert-routing combiners (where a router picks a single specialist per token rather than blending), and temperature-adjusted mixtures. All of these are interesting research directions; none of them are sum-of-log-probs.

Connection to glass-box LLM (Bet 18)

The mixture combiner has a clean reconciliation property that the audit trail depends on. The joint log-prob equals logsumexp(log p_i) − log N. The per-specialist log-probs sum (under softmax weighting) to exactly the joint log-prob, modulo numerical precision. This means: for any output token, the audit trail can show:

The joint log-prob the federation assigned.
The per-specialist log-prob each specialist assigned.
The reconciliation residual (numerical mismatch between the two).

If the residual is close to zero, the audit trail is mathematically consistent. Bet 18 measured the median reconciliation residual at 3 × 10⁻⁷ across 1,000 tokens × 3 specialists. This is tight enough that any non-trivial residual is a bug signal — the audit trail tells you when something is wrong.

Sum-of-log-probs, by contrast, doesn't reconcile cleanly, because the normalising constant Z makes the joint log-prob non-additive in the per-specialist log-probs. You can compute attribution under sum-of-log-probs but it's a more fragile derivation. The mixture combiner makes the audit trail mathematically exact.

What this does not claim

The mixture combiner is the optimal combiner. It's the safe default. A learned combiner with per-context weights might outperform mixture on specific tasks.
All specialists should be weighted equally. Mixture uses uniform weights (1/N). A weighted mixture with weights inferred from context is a natural extension. We have not validated weighted mixtures in this catalogue; uniform mixture is the production default.
Mixture is bandwidth-free. The federation has to compute log p_i for every token from every specialist, which requires every specialist to participate in every inference step. There's a real cost. Bets 03 (keyword router) and 16 (LRU directory) explore the orthogonal question of how to avoid computing every specialist for every token; the mixture math applies once you've decided which specialists to query.

Run command

PYTHONPATH=src python -m experiments.bets.04_mixture_combiner

Output: experiments/bets/results/04_mixture_combiner.json. The module also writes a per-specialist contribution trace to out/04_attribution.json, used by Bet 18 (glass-box) to verify reconciliation.

Bet 18: Glass-box LLM. Per-token attribution depends on the mixture combiner's reconciliation property.
Bet 31: Linear weight-soup (FALSIFIED). The wrong way to combine specialists — average their weights. The mixture combiner is the right way: average their outputs, not their parameters.
Bet 33: Task-vector extrapolation soup (also FALSIFIED). Another weight-space combination that doesn't survive.
Bet 55: Multi-adapter logit ensemble for unknown users (FALSIFIED). Mixture combining per-user adapters does not generalise to new users. The mixture math works for specialists but not for adapters across users — a subtle distinction this bet's literature framing didn't predict.

Why it matters

If sum-of-log-probs had worked, federation would have a different shape — independent specialists each contributing a confident vote, with the joint converging on the consensus. Because sum-of-log-probs doesn't work, federation has to embrace soft union: each specialist contributes a partial perspective, and the joint is the mixture, not the consensus. That choice has architectural consequences. It means the federation is happy to have N specialists with overlapping competence, because the mixture lets each contribute where it's confident. It also means specialist disagreement is not a bug to suppress — it's the substrate the mixture combiner operates on. Disagreement is informative; the mixture lets it be informative without destroying the joint output.

This is also the bet that justifies the federation's entire per-token transparency story (Bet 18). You can only build a glass-box LLM on top of a combiner that reconciles. Sum-of-log-probs doesn't reconcile cleanly. Mixture does. The audit trail is downstream of the combiner choice, and the audit trail is what makes federation deployable in regulated environments. So: the entire glass-box property hinges on this bet's result.