Bet 31 — Linear weight-soup (FALSIFIED)

This is the cleanest falsification in the catalogue. The hypothesis is intuitive enough that everyone who proposes a federation suggests it within the first hour: if you have two specialists trained on different data, average their weights. The averaged model should inherit the strengths of both. It doesn't. Every linear interpolation between two independently-trained specialists is strictly worse than either parent. The bet's job is to make this falsification visible enough that the federation doesn't have to keep refalsifying it as new contributors propose it.

The "model soup" idea has a real research lineage — Wortsman et al., 2022 demonstrated that fine-tuned models with shared initialisations can be averaged productively. The bet's contribution is the negative version: in the regime the federation actually operates in (independent specialists, no shared fine-tuning trajectory), soup doesn't work. The federation primitive that does work is inference-time output-space mixture (Bet 04). This bet is the falsification that pushes the federation from "maybe combine in weight space" to "definitely combine in output space."

Background — what model-soup is, and why it sometimes works

The original model-soup result is a real positive finding in a constrained regime:

Start from a single pretrained checkpoint.
Fine-tune that checkpoint multiple times with different hyperparameters (different learning rates, different data orderings, different augmentation seeds).
Average the resulting fine-tuned models' weights.
Find that the average is sometimes better than any individual fine-tune.

The intuition is that all the fine-tuned models occupy the same loss basin (they share an initialisation; the fine-tunes are short and don't escape the basin), and the average lies in the same basin but at a flatter point — which generalises better.

That regime has two preconditions: shared initialisation and same-basin fine-tunes. The federation's specialists violate both. Two specialists are trained from scratch (or from shared base but with hours of independent training), end up in different basins, and there's no a priori reason to expect a midpoint between basins to be a good model.

The bet asks: is the federation's regime nonetheless soup-friendly, or is it not?

Hypothesis

For at least one α ∈ [0.1, 0.9], the soup α × specialist_1 + (1 − α) × specialist_2 produces ppl on a held-out eval text that is lower than both parents' ppl.

The "for at least one α" framing is the loosest possible win condition. We don't require the soup to be optimal; we don't require it to beat the parents at every α; we just require some α to be a sweet spot. If even that fails, soup is falsified.

Pre-registered criteria

STRICT: soup beats both parents on ≥ 1 α and ≥ 1 eval text.
LENIENT: soup ≤ better-parent ppl on ≥ 1 α (i.e. soup at least matches the better parent somewhere).
CATASTROPHIC: every α ∈ [0.1, 0.9] makes soup ppl worse than the worse parent (would mean the soup is strictly worse than both parents everywhere).

Setup

Two specialists, both fine-tuned from the FractalMoE 30M base model on different corpora:

Specialist 1 ("late"): fine-tuned on a programmer-style text fixture, 1000 steps.
Specialist 2 ("early"): fine-tuned on a novelist-style text fixture, 1000 steps.

Both fine-tunes are full-parameter (38M params each), so the soup is over the full parameter set, not just an adapter. The soup is computed elementwise: soup_param = α × spec1_param + (1−α) × spec2_param for every parameter in the model.

Sweep α ∈ {0.1, 0.3, 0.5, 0.7, 0.9}. Evaluate each soup on a held-out text matching specialist 1's distribution (programmer-style). The expectation: soup at low α (closer to specialist 2, which doesn't match the eval) should be worse; soup at high α (closer to specialist 1, which matches) should be better; and the question is whether some intermediate α — combining both specialists' strengths — beats specialist 1 alone.

Result — CATASTROPHIC

| α | Soup ppl | Better-parent ppl | |---|---|---| | 0.1 | 224 | 187 | | 0.3 | 280 | 187 | | 0.5 | 351 | 187 | | 0.7 | 433 | 187 | | 0.9 | 502 | 187 |

The soup is strictly worse than the better parent at every α. Worse, the relationship between α and ppl is monotonic — as α moves from 0.1 to 0.9, the ppl gets worse, not better. There's no sweet spot, no intermediate minimum. The soup at α=0.1 is already 20% worse than the better parent; by α=0.9 it's 168% worse.

This is the CATASTROPHIC outcome: every α in the swept range makes soup worse than even the worse parent (the worse parent's ppl is around 220 on this eval text — only at α=0.1 does soup get within range, and even there it loses).

Why it failed — the loss-basin geometry

The two specialists' weights live in different loss basins. A loss basin is a region of weight space where the loss is low and gradient descent has settled; the specialist's fine-tune trajectory is the path from the base into that basin. Two different fine-tunes on different data go to different basins.

Linear interpolation between two basins traces a line in weight space. That line crosses through high-loss regions between the basins. The midpoint of the line — 0.5 × spec1 + 0.5 × spec2 — is generally not in either basin; it's in the saddle region between them. The model evaluated at the midpoint has the loss of a saddle, which is high.

The monotonicity of the result is consistent with this geometry: as α moves from 0.1 (close to specialist 2, in spec 2's basin) toward 0.9 (close to specialist 1, in spec 1's basin), the model traces a path that enters spec 1's basin gradually but leaves spec 2's basin first. Most of the path is outside both basins, in the high-loss saddle region. Only at the extremes does the soup get close to a real basin, and even at α=0.9 the soup hasn't fully entered spec 1's basin.

Why the original model-soup result doesn't apply

Wortsman et al.'s soup works because all the soup ingredients share an initialisation and a basin. The hyperparameter-sweep fine-tunes don't leave the basin; they wiggle around inside it. The average of wiggles inside a basin is also inside the basin, possibly at a flatter point.

The federation's specialists don't share a basin. Even if they share an initialisation (the base model), the long fine-tune trajectories take them into separate basins. The federation operates in the regime where soup fails, not the regime where it works.

This is the kind of result that's only obvious in retrospect. It's perfectly natural to read about model-soup, find the result encouraging, and assume it generalises. The bet's CATASTROPHIC failure is the data point that says "the published result doesn't generalise to this regime."

What replaces this — Bet 04 (mixture combiner)

The federation primitive that actually works for combining specialists is the inference-time output-space mixture of Bet 04:

combined_log_prob[token] = logsumexp(per_specialist_log_prob[token]) − log(N)

This combines specialists at the output level, after each specialist has run its forward pass independently. There's no weight-space interpolation; each specialist computes in its own basin, then their outputs are combined.

By Jensen's inequality, the combined log-probability is bounded by the best specialist's log-probability for any given token. The mixture is provably no worse than the best specialist. It's also the formalism that makes glass-box attribution (Bet 18) possible — each specialist's contribution is preserved as a per-token log-probability that can be audited.

The takeaway: combine in output space, not weight space. Bet 31's falsification is what makes this the canonical federation primitive.

Bet 33 — task-vector extrapolation, also failed

Bet 33 ran a more sophisticated soup variant: extract the "task vector" of each specialist (task_vec = specialist_weights − base_weights), then combine task vectors instead of weights:

soup = base + α × task_vec_1 + β × task_vec_2

This is the task arithmetic idea — adding task vectors should give a model with both tasks' capabilities. It also failed for the federation's regime. Task vectors of independently-trained specialists don't compose additively in weight space, just like the specialists themselves don't average in weight space.

Two failed soup variants (linear soup, task-vector arithmetic) is enough evidence to close the door. The federation does not combine specialists in weight space.

What this leaves open

Soup with shared fine-tuning trajectory. If two specialists were fine-tuned with shared random seeds and a shared early phase before diverging, would soup work for them? Plausibly yes, by the original Wortsman result. The federation's specialists are not built this way, but a future federation variant could be. Open.
Soup at very small α. α=0.05 might be in spec 2's basin still and produce a useful "mostly-spec-1-with-a-little-spec-2" model. Not tested explicitly; the sweep started at α=0.1. Probably not worth pursuing because the inference-time mixture handles this case better.
Bayesian model averaging. A more sophisticated weight combination (e.g. weighted by specialist log-marginal-likelihood on a calibration set) might produce a better soup. Untested. Probably also not worth pursuing because Bet 04's mixture combiner is mathematically clean and operationally cheap.
Soup at scale. At 70B+ params, the loss landscape is geometrically different — it's empirically flatter and more convex-like in some directions. Soup may work better at scale. The federation operates at 30M (base) + per-user adapters; the at-scale soup question is open but not load-bearing for current deployment.

Run command

PYTHONPATH=src python -m experiments.bets.31_model_soup

Output: experiments/bets/results/31_model_soup.json records the soup ppl at each α, the per-parent ppls, and the eval-text identifier. Plus the soup ppls at α=0 and α=1 (which trivially recover each parent) as sanity checks.

Bet 04: mixture combiner. The output-space combination that replaces weight-space soup.
Bet 33: task-vector extrapolation. Second falsified soup variant.
Bet 54: cross-user adapter averaging. Same lesson at the adapter level — averaging across users mostly fails.
Bet 55: multi-adapter logit ensemble. Ensembling across users' adapters for an unknown user — also fails. Routing required.

Why it stays in the catalogue

Falsifications stay linked. A reader scanning the catalogue for "specialist combination primitives" must encounter this entry. The retraction defines what the federation is by ruling out what it isn't.

The model-soup proposal is intuitive enough that it gets re-suggested every few months by new contributors who are familiar with the positive result and not the regime constraints. The catalogue's job is to provide the falsification as a cheap reference: "Bet 31 falsified soup at the federation's regime; here's the data; combine in output space instead."

The methodological lesson the bet encodes — published results have regime constraints, and a result that works in the original regime may not generalise — is also load-bearing for the broader catalogue. Many federation primitives sound like they should work because they worked elsewhere; the catalogue's empirical discipline is to test them in the federation's regime, even when the published evidence is encouraging. Bet 31's clean failure is one of the cleanest examples of this discipline producing a falsification.