Methodology

SharedLLM research is organised as a sequence of falsifiable bets. Each bet declares its hypothesis, its strict / lenient / catastrophic criteria, and the experimental procedure before the experiment runs. The result either supports or falsifies the claim, and the framing in our public writeups must match what the data shows — not what we hoped it would show.

Why bets, not papers

A standard ML paper claims a finding and accumulates evidence in its favour. A bet declares the conditions under which the claim would be wrong, and reports the outcome regardless. Three things follow from that:

  • Pre-registered criteria. We write the strict / lenient / catastrophic thresholds before running the experiment. Moving the goalposts after seeing results is visible in git history.
  • Falsifications stay in the catalogue. When a bet fails, it gets a FAIL tag and stays linked from the index. We retire claims publicly when later evidence undermines them. See Bet 31, Bet 38, Bet 55, and Bet 62 for examples — the last of which retracted the headline of an earlier bet.
  • Negative controls before deployment. Any positive personalization claim must clear a noise floor — training on random tokens, not real text. See Bet 60.

The shape of a bet

Every bet file in experiments/bets/ follows the same structure:

  1. Module docstring naming the hypothesis, listing the strict / lenient / catastrophic criteria, and giving the run command.
  2. Constants block for hyperparameters, eval texts, seeds. Visible at the top so the experimental setup is auditable in one screen.
  3. Reusable harness imports — every bet uses experiments/bets/_common.py for registry setup, specialist loading, and result writing. Consistency over cleverness.
  4. main() that runs the experiment, computes the pass/fail flags, writes a JSON payload to results/NN_*.json.
  5. 00_rollup.py regenerates SUMMARY.md and SUMMARY.json from the result files. The summary is the entry point for anyone reviewing the harness.

What we do not claim

The bets harness operates at the 30M-parameter scale on short held-out texts. We do not claim:

  • that any primitive validated here will hold at 1B+ parameters on production data without further work;
  • that K=100 DiLoCo or any other federated training recipe replaces synchronous SGD at frontier scale;
  • that the throttle-invariance result validates phone-class deployment — that requires on-device testing for thermal, memory pressure, OS process kill;
  • that an in-process FastAPI test harness substitutes for measured cross-ISP federation throughput.

These are open questions, not solved problems. They are listed in the Open Questions chapter.

Reproducing a bet

Every bet is a runnable Python module. From the project root:

PYTHONPATH=src python -m experiments.bets.NN_<name>

Result lands in experiments/bets/results/NN_*.json. The roll-up regenerates the summary table:

PYTHONPATH=src python -m experiments.bets.00_rollup