Bet 62 — DiLoCo with controls (the honest follow-up to Bet 50)

This is the retraction. An external reviewer pushed back on Bet 50's headline — "DiLoCo with K=100 inner steps outperforms K=1 by 24% on real text" — pointing out that single-seed, fixed-step-count training without early stopping is exactly the setup where K=1 (synchronous SGD) would overfit and K=100 (with implicit smoothing across the K-step inner loop) would not. The 24% gap might not be "K=100 is a better optimizer"; it might be "K=1 overfitted at the final step we read."

The reviewer was right. Bet 62 is the disambiguating follow-up that confirmed it. We retracted Bet 50's headline. This entry documents the retraction in detail — what the new evidence shows, what the corrected framing is, and what the methodology gained from running it.

Background — what Bet 50 had claimed

DiLoCo (Douillard et al., 2023) is a federated training scheme: each worker runs K local SGD steps independently, then the workers average their updates and synchronise. K=1 is equivalent to synchronous SGD (every step is averaged). Larger K reduces the all-reduce frequency by a factor of K, dramatically lowering bandwidth cost — the practical motivation for federation across consumer hardware.

Bet 50's setup: train a 30M-param FractalMoE for a fixed step count under K=1 vs K=100, on real text, and compare final losses. The result read:

K=100 final loss is 24% below K=1 final loss after 1,000 training steps. K=100 outperforms K=1 by 24%.

The framing was that K=100 was a better optimizer at this scale, with the side benefit of 1/100th the bandwidth cost. The federation thesis got a boost: not only is K=100 cheaper to communicate, it's also numerically better.

This was wrong. Or rather, it was a partial reading of the data. Bet 50's setup didn't separate optimizer quality from training-trajectory artefacts. K=1 trains faster per gradient step (it's noisier and more responsive), so it converges to a low training loss quickly, then begins to overfit. K=100 trains slower per gradient step (the inner-loop smoothing damps noise), so it converges more slowly and doesn't overfit at the same step count. After 1,000 steps:

  • K=1 has long since converged on the training set and is mid-overfit. Final loss high.
  • K=100 hasn't fully converged yet but is nowhere near overfitting. Final loss lower.

The 24% gap is real but it's not "K=100 is a better optimizer." It's "K=1 overfits at this step count, K=100 doesn't yet." If we ran K=1 with early stopping (track validation loss, stop when it bottoms out), K=1's best loss would be much lower than its final loss — and the comparison reverses.

That's the core methodological issue. Bet 62 ran the controlled experiment to measure it.

Setup

Three random seeds. Two K-values (K=1, K=100). 2,500 training steps each — longer than Bet 50, deliberately, to give K=1 enough rope to overfit visibly. Hold out a validation set; track best validation loss across the run. Compare best validation loss across K-values, not final loss.

Other parameters identical to Bet 50: same model (FractalMoE 30M), same dataset (real text mix), same optimiser, same lr schedule. The only differences from Bet 50 are: more seeds, more steps, early-stopping bookkeeping.

The pre-registered criteria were written from the perspective of "Bet 50's headline survives" being the null hypothesis:

  • STRICT (favouring Bet 50): K=100 best ≤ K=1 best by ≥ 10% on ≥ 2 of 3 seeds.
  • LENIENT (favouring Bet 50): K=100 best ≤ K=1 best by ≥ 5% on ≥ 2 of 3 seeds.
  • FALSIFIES Bet 50: K=1 best < K=100 best on ≥ 2 of 3 seeds.

Result — FALSIFIES Bet 50

The full table. Loss is in nats (log perplexity); lower is better.

| Seed | K=1 best | K=100 best | K=1 final | K=100 final | K=1 advantage at best | |---|---|---|---|---|---| | 0 | 3.142 | 3.298 | 4.412 | 3.508 | +5.0% | | 1 | 3.198 | 3.371 | 4.529 | 3.601 | +5.4% | | 2 | 3.067 | 3.541 | 4.398 | 3.638 | +15.4% |

K=1 wins on best-validation in all 3 seeds, by 5%–15%. The headline number from Bet 50 (24% K=100 advantage) is reversed.

Looking at the K=1 final-vs-best gap: final − best is +1.27 to +1.32 nats per seed. K=1 overfits significantly between its best step (typically around step 800–1,000) and the final step (step 2,500). K=100's final − best is much smaller (+0.07 to +0.21 nats) — K=100 doesn't overfit much at this step count.

Bet 50 read K=1's overfitted final loss against K=100's slightly-overfitted final loss and called K=100 the winner. Reading best loss instead — which is the loss that any sensible training run would actually deploy, with early stopping — reverses the result. K=1 is the better optimizer when used correctly (with early stopping); K=100's apparent advantage in Bet 50 was an artefact of comparing two methods at a step count where one had overfit and the other hadn't yet.

Reframed conclusion

The corrected story replaces Bet 50's headline:

  • K=1 is the better optimizer at this scale. End of that story. K=100 does not produce a better-quality model than K=1 at the optimal training duration for each method.
  • K=100 has materially lower bandwidth cost. The all-reduce is 1/100th as frequent. For any deployment where bandwidth dominates wall-clock (which describes consumer-grade federation), K=100 may still be the right choice.
  • The K-value tradeoff is bandwidth vs loss, not "K=100 is just better." Picking K is a deployment-level cost-quality decision, not a numerical-performance decision.
  • Whether K=100 starts to outperform at frontier scale is an open question. At 1B+ parameters, K=1's per-step bandwidth becomes a wall-clock bottleneck (gradient sync dominates step time), and K=100 may dominate by virtue of completing more useful work per unit wall-clock. The bets harness can't reach this regime; it's listed in the Open Questions chapter.

Why this matters more than Bet 50 mattered

Bet 50 was load-bearing for the federated-training story. The narrative was: "DiLoCo with K=100 is both cheaper and better, so federation across consumer hardware is dominated by this approach." That narrative had a strong, simple, impressive framing. It was also wrong.

The corrected narrative is more nuanced: "K=1 produces better models when run to optimum; K=100 produces worse-but-acceptable models at 1/100th the bandwidth. Pick K based on your bandwidth budget, not on a belief that larger K is intrinsically better." This narrative is less marketing-friendly but more accurate.

The reviewer caught the error. The methodology required us to run the disambiguating experiment. The disambiguating experiment confirmed the error. The catalogue updated to reflect the correction. This is the methodology working as designed. The bets harness produced a wrong claim in Bet 50 and a right correction in Bet 62. Both bets stay in the catalogue. The pair is what credibility looks like.

What the methodology gained

The retraction is the strongest evidence the catalogue has that the methodology is doing real research, not confirmation theatre. Three concrete methodological gains:

  1. The "compare best, not final" rule. Any future training-trajectory bet must report best-validation-loss with early stopping, not final-loss-at-fixed-step-count. This rule is now in the methodology page; it's load-bearing for future federated-training work.

  2. The "multiple seeds for trajectory comparisons" rule. Single-seed trajectory comparisons are unreliable — the best step's location depends on the seed, and any fixed step count picks a different "best" for each seed. The methodology now requires ≥ 3 seeds for any optimizer-comparison bet.

  3. The "name the alternative explanation" rule. When a result is dramatic (24% improvement is dramatic), the writeup must explicitly enumerate alternative explanations and rule them out. Bet 50 didn't enumerate "K=1 overfits at the final step" as a candidate; if it had, the disambiguating follow-up would have run earlier. The methodology now requires the writeup to include an "alternative explanations considered and ruled out" section for any STRICT pass.

These three rules are now part of how the catalogue evaluates new bets. They cost a bit more thinking up-front but produce more durable claims.

What stays valid from Bet 50

The Bet 50 result, narrowly read as "K=100 final loss < K=1 final loss at 1,000 steps," is still a true measurement. The catalogue keeps Bet 50 in the historical record. The reframing is in the interpretation — what that measurement means for the federation. The corrected interpretation is in this bet.

This is the right way to handle retractions in an experiment-driven catalogue. The data isn't wrong (the numbers were measured correctly). The interpretation was wrong. We don't delete the data; we link it to the bet that corrected the interpretation, and we let readers see both. The catalogue treats data as accumulating over time and interpretations as updateable, not the other way around.

What this leaves open

  • What happens at K=10 or K=30? The bet only ran K=1 vs K=100. The full K-value sweep would be informative — there's likely a sweet-spot K where the bandwidth saving is large but the optimizer quality hasn't degraded much. Not yet run.
  • What about K=1 with explicit regularisation? If K=1's overfit problem is what hurts it at fixed step count, adding weight decay or dropout might close the gap further. Or might not. Not yet measured.
  • Frontier-scale (1B+) behaviour. This is the key open question. At small scale, K=1 dominates with early stopping. At larger scale, K=1's bandwidth cost may dominate wall-clock, flipping the comparison. The bets harness can't reach this scale.
  • Asynchronous behaviour with worker dropout. DiLoCo at K=100 is robust to worker dropout (the inner loop continues without the dropped worker, the all-reduce skips it). DiLoCo at K=1 is more sensitive (every step is a sync point). The fault-tolerance comparison wasn't measured here.

What the federation does with this result

The federation's training story is updated:

  • Default training mode: K=1 with early stopping, when bandwidth allows. This is the higher-quality regime.
  • Bandwidth-constrained training mode: K=100 (or higher), accepting some quality loss for dramatic bandwidth savings. Right for consumer-grade federation across slow ISPs.
  • K-value as a deployment-level knob, exposed to coordinators with sensible defaults. Not a hidden hyperparameter that gets one universal value.

The earlier framing — "K=100 dominates" — is removed from federation marketing. The training story is now framed honestly: bandwidth-vs-quality tradeoff, with the default chosen per deployment.

Run command

PYTHONPATH=src python -m experiments.bets.62_diloco_k_with_controls

Output: experiments/bets/results/62_diloco_k_with_controls.json includes the full 3-seed × 2-K trajectory data with both best-validation and final losses, plus the early-stopping locations per seed.

  • Bet 50: the original (now-retracted) headline. Kept in the catalogue with a redirect to this bet.
  • Bet 06: async DiLoCo convergence smoke test. Validated the protocol; didn't make optimizer-quality claims.
  • Bet 30: stale-grad async (K=infinity). Different family of async training; the bet here doesn't apply directly.
  • Bet 44: byzantine-robust aggregation. The federation's robustness story is independent of K; this bet is unaffected.

Why it matters

Bet 62 is the strongest single piece of evidence that the bets harness is doing real research, not confirmation theatre. A claim was made (Bet 50). A reviewer pushed back. The disambiguating experiment was run. The original was retracted. That loop is what calibrated research looks like. Every other bet in this catalogue inherits credibility from the fact that this loop ran on the most consequential one — the one that, if left uncorrected, would have shaped the federation's training architecture around a misleading framing.

The retraction is more valuable than the original confirmation would have been. It demonstrates that the catalogue's wins are the wins that survived their own disambiguating follow-ups. The reader can trust the wins because the catalogue contains visible retractions. Without retractions, the catalogue is just selection bias dressed up as methodology. With them, it's calibration.