Bet 50 — DiLoCo K=100 vs K=1 on real text (RETRACTED)

This is the most consequential retraction in the catalogue. The original bet appeared to show that async DiLoCo with K=100 inner steps outperformed K=1 (synchronous SGD) on real-text training by 24%, a margin large enough to justify the bandwidth savings of K-step async training. The bet held that result for several days. Then a reviewer pointed out the experimental design didn't disambiguate "K=100 is genuinely better" from "K=1 is overfitting at the final step." The disambiguating follow-up (Bet 62) was run with proper controls — multiple seeds, early stopping, longer horizon — and the result inverted: K=1 best wins by 5–15% across all seeds. Bet 50 was reading overfitted final K=1 against less-overfit final K=100, not actually comparing the methods.

The retraction stays in the catalogue because removing it would hide the methodology failure. The corrected framing lives in Bet 62; this entry exists to make the retraction visible and to give readers a direct path to what the corrected analysis shows.

Background — what DiLoCo is and why K matters

DiLoCo (Distributed Low-Communication training) is a distributed-training method designed for the bandwidth-constrained regime. The core idea is to let each node run K local SGD steps between aggregations, rather than aggregating after every step. K=1 is standard synchronous SGD; K=100 means each node runs 100 inner steps before contributing its accumulated update to the federation.

The benefit of large K: bandwidth cost is divided by K. If the federation aggregates once per 100 local steps, the all-reduce traffic is 1/100th of what it would be at K=1. For the federation's deployment scenario (consumer-grade internet, asymmetric bandwidth, occasional flaky connections), this is a substantial operational win.

The cost of large K: each node's local trajectory diverges from the federation's average. By the time aggregation happens, the K-step accumulated update is a stale gradient — it was computed against parameters that have since drifted on other nodes. Larger K → more drift → potentially worse convergence.

The tradeoff is bandwidth-vs-loss. The original Bet 50 hypothesised the tradeoff was actually free — that K=100 is both lower bandwidth and lower loss. The retraction inverted this: K=100 is lower bandwidth at the cost of higher loss.

Original hypothesis

Async DiLoCo with K=100 inner steps outperforms K=1 (synchronous SGD) on real-text training by a margin large enough to justify the bandwidth savings.

Original criteria

  • STRICT: K=100 final loss < K=1 final loss by ≥ 10%.
  • LENIENT: within 5%.
  • CATASTROPHIC: K=100 final loss > K=1 by > 50% (would falsify async DiLoCo).

Original result — STRICT PASS (single seed, no early stopping)

K=100 final loss was 24% below K=1 final loss after a fixed number of training steps. The headline read: "K=100 DiLoCo outperforms K=1 by 24%."

The implication was that the federation's training default should be K=100 — gets the bandwidth savings and gets better loss. Win-win. The federation's training-time bandwidth budget would be cut by 100× without quality cost.

This result was load-bearing for the federated-training story. If K=100 just worked, the federation could pretty straightforwardly scale to large numbers of low-bandwidth contributors. The whole "open-participation training" thesis hinged on the K-step async pattern being viable.

What was wrong — the reviewer pushback

A reviewer asked the right question: was the 24% margin actually a comparison of the two methods, or was it a comparison of K=1's overfitting with K=100's lesser overfitting?

The original experiment ran a fixed number of training steps and compared final losses. At a fixed step count, K=1 has done many more parameter updates than K=100. K=1 updates parameters every step; K=100 updates parameters once per 100 steps (because it's accumulating). For a fixed total step count, K=1 has had 100× more "shots" at fitting the training distribution, and on a finite training corpus, more shots means more overfitting.

The 24% margin could mean any of these:

  1. K=100 is genuinely a better optimiser for this loss surface.
  2. K=1 is overfitting at the final step; its loss has bottomed out and started rising; K=100 hasn't yet hit its overfit point.
  3. Some combination of (1) and (2), with the relative contribution unclear.

Single-seed, fixed-step-count, no-early-stopping experiments cannot distinguish between these. The original Bet 50 was a single-seed run with a fixed step count, no early stopping, no held-out validation set. The 24% margin was a real number, but it didn't establish what was claimed.

The disambiguating follow-up — Bet 62

Bet 62 ran the corrected experiment:

  • 3 seeds × 2 K-values (K=1, K=100) × 2,500 training steps × held-out validation set.
  • Early stopping: at each evaluation interval (every 50 steps), measure validation loss; track the minimum.
  • Compare K=1 best loss (the minimum reached during training) vs K=100 best loss.

This setup eliminates the overfitting confound. Each method gets to be evaluated at its best point, not at an arbitrary step count where one method may be overfit and another may not.

Result: K=1 best wins by 5–15% across all seeds. K=1 always overfits (final loss − best loss = +1.27 to +1.32 nats). Bet 50 was reading overfitted final K=1 against less-overfit final K=100, not actually comparing the methods.

The seed-by-seed numbers from Bet 62:

| Seed | K=1 best loss | K=100 best loss | K=1 advantage | |---|---|---|---| | 1 | 5.42 | 5.71 | 5.4% | | 2 | 5.38 | 5.85 | 8.7% | | 3 | 5.41 | 6.20 | 14.6% |

K=1 wins on every seed by margins of 5–15%. The result is consistent across seeds; it's not a single-seed accident. K=1 is the better optimiser at this scale.

Retracted framing

The corrected framing for the K-step DiLoCo decision:

  • K=1 is the better optimiser at this scale, end of story. There's no free lunch where K=100 happens to be both lower bandwidth and lower loss.
  • K=100 has lower bandwidth cost (1/100th of the all-reduce traffic) at the cost of slightly worse loss (5–15% margin against K=1 best).
  • The K-value tradeoff is bandwidth-vs-loss, not "K=100 is just better." The federation has to decide which it values more on a per-deployment basis.
  • At what scale and what bandwidth budget K=100 starts to dominate is an Open Question. Plausibly, at very large model scale (1B+) and very low bandwidth budget (mobile-network constrained), K=100 becomes the right default. At the federation's current scale (30M base + per-user adapters) on consumer broadband, K=1 is the right default.

The federation's training default is now K=1 with periodic aggregation, not K=100 async. The bandwidth cost is higher but the convergence quality is better. For deployment scenarios where bandwidth is severely constrained (mobile-only contributors, intermittent connectivity), the catalogue still considers K=10 or K=100 a reasonable choice — but as a constrained-bandwidth choice, not a just-better choice.

Why this entry stays in the catalogue

The methodology requires retracted bets to stay linked. Removing Bet 50 from the catalogue would hide the retraction; the corrected result would still appear in Bet 62, but the original error would be invisible. A reader who later proposes "let's run K=100 because it's better" would be unaware that the catalogue once said this and was wrong.

Keeping Bet 50 with a redirect to Bet 62 makes the retraction visible. It also documents the methodology failure (single-seed, no-early-stopping comparisons mislead) so future bets can avoid it.

The catalogue's discipline is that retractions are first-class citizens. A bet that was wrong is a feature of the catalogue's empirical history, not a bug to be hidden. The catalogue's value comes from its claims being trustworthy at face value, and a key part of trustworthiness is making it clear which claims have been corrected and how.

Lessons for future bets

Three concrete methodological lessons from this retraction, encoded in the bets harness:

  1. Always run multiple seeds. Single-seed results are not interpretable. Bet 43's 5-seed × 3-eval-text replication standard exists because of this lesson and the Bet 50/62 episode.

  2. Always use early stopping with a held-out validation set. Comparing methods at a fixed step count is comparing methods at different points on their convergence curves, which can flip the comparison.

  3. Be explicit about what's being compared. "K=100 outperforms K=1" is ambiguous — at what step count? With what stopping rule? The Bet 50 framing was sloppy; the Bet 62 framing is precise ("K=1 best vs K=100 best on held-out validation, across 3 seeds").

These lessons are now baked into the bets harness. The harness's _common.py provides a run_with_early_stopping helper; bets that compare optimisers must use it. The post-hoc audit catches any bet that reports a comparison without these controls. Bet 50 / 62 is the cautionary tale that justifies the harness rules.

Run commands

The original Bet 50 still runs (preserved for historical reproducibility):

PYTHONPATH=src python -m experiments.bets.50_real_text_diloco

The disambiguating follow-up:

PYTHONPATH=src python -m experiments.bets.62_diloco_k_with_controls

Both produce JSON results with their respective hypotheses and outcomes. Reading them in sequence is the recommended way to understand what the federation knows about K-step async training: original (apparent win), retraction (controlled inversion), corrected framing (bandwidth-vs-loss tradeoff).

  • Bet 62: the disambiguating follow-up. The corrected framing lives there.
  • Bet 43: 15/15 replication standard. Came from this retraction's methodology lesson.
  • Bet 06: original DiLoCo convergence smoke. Not retracted; smaller-scale validation that DiLoCo runs without diverging.
  • Bet 30: stale-grad async at K=infinity. The extreme version of the K-step tradeoff.

Why it matters

Bet 50 is the most consequential retraction in the catalogue because the original framing was load-bearing for the federated-training story. If K=100 had genuinely been a free win, the federation could have committed to K-step async as the universal training pattern; the bandwidth story would have been comfortable. With the retraction, the federation has to be honest about the bandwidth-vs-loss tradeoff and pick a K-value per deployment.

The corrected framing is more nuanced and more honest. The methodology held — Bet 62's retraction is more valuable than a Bet 50 confirmation would have been, because it forced the federation to confront a tradeoff rather than skating past it.

The retraction is also a model for how the catalogue handles error. The original bet's authors didn't try to defend the original framing; they ran the disambiguating experiment, accepted the inversion, and updated the catalogue. The retraction is the catalogue's word, not a private correction.

For readers scanning the catalogue: when you see a bet's hypothesis confirmed, look for the controls. The presence of multi-seed early-stopping in the methodology is the signal that the result has been disciplined; the absence is a yellow flag. Bet 50/62 is what disciplined the methodology; subsequent bets are run under that discipline by default.