The Bets Methodology

A standard ML paper claims a finding and accumulates evidence in its favour. A bet declares the conditions under which the claim would be wrong and reports the outcome regardless. The bets harness in this repository — 63 falsifiable experiments at the time of writing — is built on this single inversion of the conventional research framing, plus three commitments that operationalise the inversion.

The methodology exists because the project's central thesis (community-owned distributed LLM inference) makes a lot of empirical claims that, if wrong, would invalidate the deployment story. A research artifact that's optimised for appearing right would be ill-suited; the project needs an artifact that's optimised for being right, even when "being right" means publishing falsifications and retractions. The bets methodology is the answer to that constraint.

Pre-registered criteria

Every bet writes its strict / lenient / catastrophic thresholds in the module docstring before the experiment runs. The criteria live in the same git commit as the run command. Moving the goalposts after seeing results is visible in git log.

A representative module header:

"""
Bet 49 — Adapter shootout (norm vs LoRA-r4 vs full FT).

Hypothesis: norm-only adapter Pareto-dominates LoRA-r4 and full
fine-tune across diverse users.

STRICT: norm-only median ppl < LoRA median AND < full-FT median
        on held-out same-author text for ≥ 3 of 3 users.
LENIENT: STRICT for 2 of 3 users.
CATASTROPHIC: norm-only worse on ≥ 2 of 3 users.

Run: PYTHONPATH=src python -m experiments.bets.49_adapter_shootout
"""

The result file in experiments/bets/results/49_*.json records the actual numbers and the pass/fail flag — computed mechanically from the criteria, not interpreted by the experimenter. The roll-up script regenerates SUMMARY.md from the result files; nothing in the summary is hand-written.

This structure has a single load-bearing property: the path from "bet runs" to "summary updates" goes through no editorial step. The experimenter cannot massage a borderline result into a pass; the script reads the JSON and applies the criteria. If a bet fails, the summary says it fails. If a bet was framed too generously and the result barely scrapes a pass, the summary still records the result, but a later reviewer can see the criteria in git log and assess whether the framing was reasonable.

The triple-tier framing (STRICT / LENIENT / CATASTROPHIC) exists for a reason. STRICT is the bet's headline claim — the result the experimenter would be excited to report. LENIENT is the threshold for "this is interesting but the headline framing is too strong." CATASTROPHIC is the threshold for "this would falsify the broader thesis we're testing." The three tiers force the experimenter to think about what the bet's failure modes look like, which is half the discipline of running good experiments.

Bet 53 (adapter quantisation) is a clean example of the triple-tier framing earning its keep. The bet passed LENIENT but failed STRICT (int8 stayed within 1.5× of raw on 2 of 3 users, not 3 of 3). It also triggered CATASTROPHIC (int8 was worse than no-adapter on the scientist user). Without the CATASTROPHIC tier, the bet might have been written up as "borderline pass on LENIENT" and shipped int8 as a default; with the CATASTROPHIC trigger, the catalogue knows the deployment hazard is real and does not blind-quantise per-user adapters.

Falsifications stay in the catalogue

When a bet fails, it gets a FAIL tag and stays linked from the index. The project retires claims publicly when later evidence undermines them. The Honest Falsifications chapter lists every entry that didn't survive — the most consequential being Bet 62, which retracted Bet 50's headline.

The harness treats falsification as the cheapest form of progress. A bet that fails strikes a hypothesis off the list at one experiment's cost. A bet that passes adds evidence; a bet that retracts an earlier claim adds calibration. Both are valuable; both stay visible.

The discipline matters because the obvious alternative — quietly removing failed bets from the catalogue — would corrupt the entire methodology. Removing Bet 31 (linear weight-soup falsified) would let a future contributor re-propose it without the prior falsification visible. Removing Bet 38 (expert collapse falsified) would invite a future "compress the MoE by averaging experts" optimisation suggestion that the catalogue has already ruled out. Each falsification is load-bearing because it answers a recurring question; removing it leaves the question open.

The current catalogue contains:

Wins that survived their criteria: most of the federation's design defaults come from these.
Falsifications that ruled out specific approaches: Bet 31 (model-soup), Bet 38 (expert collapse), Bet 40 (layer-skip), Bet 55 (logit ensembling for unknown users). Each closes a tempting wrong direction.
Retractions of earlier wins by later evidence: Bet 50's K=100 DiLoCo headline retracted by Bet 62's controlled comparison. The original bet stays linked with a redirect to the retraction.
Reframings where the headline survived but the interpretation tightened: Bet 37's norm-only fine-tune kept the result but reframed it (Bet 60 showed most of the headline ppl drop is regularisation, not personalisation; Bet 61 isolated the actual personalisation signal at 5–29% margin).

Each of these categories is a different relationship between bets and their later evidence. The catalogue treats all four as first-class — the bet entry, the follow-up entry, and the relationship between them are visible in the chapter structure.

Negative controls before deployment

Any positive personalization claim must clear a noise floor. Bet 60 trained the same norm-only adapter on random uniform tokens (no language signal at all) and compared the result to real-text training. The margin is 1.10× to 1.36× across users — real but smaller than the original Bet 37 framing would suggest. For the programmer user, random-trained beat real-trained on held-out text. The framing in Bet 37's writeup was tightened accordingly.

Bet 61 ran the disambiguating follow-up: train one adapter per user, evaluate every adapter on every user's held-out text. Own-adapter wins by 5–29% margin, depending on user. That confusion matrix — not the raw norm-only-vs-full-FT margin — is what makes the personalization claim credible.

The negative-control discipline is a methodological commitment that the catalogue has had to discover the hard way. The pattern is:

Initial bet shows large positive effect (Bet 37: 110× ppl drop).
Methodology not initially scrupulous about controls.
Reviewer or follow-up bet runs the negative control.
Result: most of the effect is regularisation; the actual personalisation signal is much smaller (5–29%).
Earlier bet's framing tightens; the catalogue's per-user adapter claim becomes calibrated.

This pattern recurs. The catalogue's discipline is now to require negative controls for any positive claim before the claim is treated as load-bearing. The harness's _common.py provides scaffolding for noise-floor controls and confusion-matrix analyses; new bets that make personalisation claims are expected to use them.

The cost of negative controls is real — they take additional experimenter time and compute. The benefit is that claims that survive are calibrated. A federation built on uncalibrated claims would deploy and fail; a federation built on calibrated claims may deploy at a slightly smaller scale than the uncalibrated marketing version, but it actually works.

What we do not claim

The bets harness operates at the 30M-parameter scale on short held-out texts. The catalogue does not claim:

That any primitive validated here will hold at 1B+ parameters on production data without further work. The Open Questions chapter makes this explicit. Until the 1B+ adapter shootout runs, the deployment story has a known scoping limit.
That K=100 DiLoCo or any other federated training recipe replaces synchronous SGD at frontier scale. Bet 62 retracted the original Bet 50 headline. The K-value tradeoff is bandwidth-vs-loss, not "K=100 is just better." The right K-value at scale is open work.
That throttle-invariance validates phone-class deployment. Bet 45 measures scheduler correctness; phone deployment requires on-device thermal, memory, and process-kill testing that the harness can't perform.
That an in-process FastAPI test harness substitutes for measured cross-ISP federation throughput. The most consequential open question. The harness validates protocol correctness; the operational question of WAN behaviour is unmeasured.

These are open questions, not solved problems. The Open Questions chapter lists them explicitly with the gates that would need to clear for each one to close.

How retraction works in practice

The Bet 50 / Bet 62 episode is the catalogue's clearest example of the methodology under pressure. The original Bet 50 claimed K=100 DiLoCo outperforms K=1 by 24%. The bet was a single-seed run with a fixed step count and no early stopping. A reviewer pushed back: was this K=100 winning, or K=1 overfitting at the final step?

The disambiguating follow-up (Bet 62) ran with proper controls — 3 seeds × 2 K-values × 2,500 steps × early stopping on a held-out validation set. The result inverted: K=1 best wins by 5–15% across all seeds. K=1 always overfits (final loss 1.27 to 1.32 nats above best), and Bet 50 was reading overfitted final K=1 against less-overfit final K=100, not actually comparing the methods.

The catalogue's response:

Bet 50 stayed. The original bet entry still exists, with its run command, its result, and its hypothesis. Removing it would hide the retraction.
Bet 62 was added. The disambiguating follow-up has its own entry with the controlled experiment.
Bet 50's writeup was updated with a clear redirect to Bet 62 and the corrected framing.
The federation's training default changed. K=1 with periodic aggregation became the new default, replacing the K=100 async pattern that Bet 50 had implied.
The methodology tightened. Multi-seed and early-stopping became required for any optimiser comparison; the harness's _common.py was updated.

The whole episode took about a week. It's a load-bearing example because it shows the methodology working under pressure: a wrong claim was corrected mechanically, the corrected version was publicly recorded, and the corrected framing replaced the wrong one in the federation's design.

Why this works

The methodology has one load-bearing property: honest retraction is cheap if you're willing to do it, and the system is designed to make you willing. Pre-registered criteria mean the retraction is mechanical — the JSON file already knows whether the result cleared the threshold. The summary is regenerated from those JSON files. There is no place in the workflow where a wishful framing can survive contact with the data.

That doesn't make the methodology bulletproof. There are failure modes the catalogue is aware of:

Pre-registered criteria can be too lenient. A bet that pre-registers a criterion the experimenter is confident will pass is not really being tested. The catalogue's discipline requires criteria that are non-trivial — the experimenter should expect uncertainty about whether the bet will pass.
Replication is bounded. The catalogue has 63 bets, but most run on a single laptop with a single experimenter. Cross-laboratory replication is not part of the methodology. A bet that replicates 15/15 within one experimenter's environment is more reliable than a single-seed bet, but less reliable than a multi-laboratory replication.
Scope creep in claims. Even with pre-registered criteria, the writeup language can drift toward broader claims than the bet supports. The catalogue's discipline includes reviewing writeups for "does this language match what the bet measured." Bet 45 (throttle-invariance) is the clearest example of this discipline — the original framing claimed phone-deployment validation, the corrected framing claimed scheduler-correctness only.
The harness can't validate operational behaviour. All 63 bets use a Python test harness; none use real WAN, real phones, or real fleet deployments. The methodology is a tool for protocol-level validation; the operational-level open questions need different tooling.

The methodology is calibrated to the scope it works for: protocol-level claims at developer-machine scale. Operational-level claims need other validation (real-WAN measurement, on-device testing, fleet-scale pilots). The catalogue's discipline is to be honest about which claims fall in which category.

What this enables

The catalogue's discipline is the foundation for everything else the federation aims to do:

Trustworthy claims. Other researchers can rely on the catalogue's results because the methodology is visible and the retractions are public.
Auditable design decisions. Why does the federation use coordinate-wise median for byzantine aggregation? Bet 44. Why is the per-user adapter norm-only at 9 KB? Bet 49. Why is K=1 the training default? Bet 62. Each design choice has a corresponding bet in the catalogue.
Onboarding for contributors. A new contributor proposing "let's average specialists' weights to combine them" can be pointed at Bet 31 immediately. The catalogue answers recurring questions cheaply.
Transparency for institutional partners. Regulated deployment partners (healthcare, education, finance) can audit the federation's evidence base by reading the catalogue. The federation's transparency story (Bet 18) operates at the per-token level; the catalogue's transparency story operates at the project level.

That's a different kind of value than a typical ML paper produces. A paper makes a claim; a catalogue makes a discipline visible. The federation aims to be deployable by people who don't trust the federation's authors on faith; the catalogue is the artifact that makes that trust possible.

What this means for readers

Treat the catalogue as a calibrated source. When a bet says STRICT PASS, it means exactly what the criteria say. When a bet says FALSIFIED or RETRACTED, the original direction has been ruled out; the corrected version (if any) is linked. When the open-questions chapter lists a question, it means the federation hasn't measured the answer yet.

The catalogue's strongest claims are scoped to the 30M-parameter, LAN-only, clean-prose regime. Broader claims (1B+ scale, real-WAN throughput, phone deployment, production at fleet scale) are explicit open questions. A reader who wants to deploy the federation should read both the catalogue's wins and its open questions; the wins describe what's been validated, the open questions describe what hasn't.

That's the methodology in operation: claims you can trust, gaps you can verify, retractions you can read, falsifications you can replicate. The catalogue is a research artifact that's been built to be calibrated rather than convincing. The two are not the same thing.