Bet 45 — Throttle-invariance

The federated scheduler must produce identical inference output regardless of how slow individual workers are. A pipeline that includes a phone-class worker running 10× slower than the desktop workers must produce the same token sequence as a pipeline of all-desktop workers. If different worker speeds produced different outputs, the federation would have a non-deterministic inference pipeline — which is a correctness bug, not an optimisation tradeoff.

The bet validates that the scheduler is throttle-invariant: bit-exact identical token sequences across no-throttle, steady-throttle, and ramp-throttle conditions, over 100 inference runs, zero divergences. This is a clean STRICT PASS for scheduler correctness.

The bet's framing was tightened after reviewer pushback. The original framing read "throttle-invariance validates phone-class deployment." That overstates what the bet measures. The bet validates scheduler correctness under heterogeneous worker speeds — a necessary condition for phone deployment, but not sufficient evidence by itself. The corrected framing makes this distinction explicit and points at the open-questions chapter for what actually validating phone deployment requires.

Background — what scheduler-determinism requires

The federation's inference pipeline has multiple workers contributing to a single output token. The contributions can be at different points in the model (e.g. each worker handles a subset of layers, or a subset of experts in an MoE), or at different points in time (e.g. speculative-decoding workers running ahead of the verification path).

In all cases, the scheduler has to wait for the right contributions before producing the next token. The risk of non-determinism is that the scheduler might "race" — accept whichever contribution arrives first, even if that contribution is from a different worker than the one whose contribution should have been used.

A non-deterministic scheduler produces different outputs depending on the per-worker latencies. In production, this would mean:

Re-running the same query gives different answers, depending on the federation's load profile at that moment.
Reproducing a bug becomes impossible (the bug only manifests under a specific latency profile).
Pipeline auditing becomes meaningless (the audit log shows the contributions for one run; the next run might have different contributions for the same query).

Determinism is a precondition for the federation's transparency story (Bet 18 glass-box). The audit log is only useful if the same query produces the same output. If non-determinism crept in via the scheduler, every audit log would be a one-off rather than a reproducible record.

Hypothesis

Tokens generated under three throttle profiles are bit-exact identical:

No-throttle: all workers run at full speed.
Steady-throttle: every worker uniformly slowed by 5× (artificial delay injected into the inference path).
Ramp-throttle: linearly increasing per-worker latency (worker 0 at 1×, worker 1 at 1.5×, worker 2 at 2× ... worker 7 at 4.5×).

Bit-exact identity across all three profiles for 100 inference runs.

Pre-registered criteria

STRICT: bit-exact token equality across all three throttle profiles for 100 inference runs.
LENIENT: sampled-greedy equivalence across profiles (same tokens chosen by greedy decoding, but possibly different log-probs).
CATASTROPHIC: any divergence in token sequence (would indicate a race condition in scheduling).

The STRICT bar is bit-exact, not just sampled-greedy. Bit-exact means the per-token log-probabilities are identical, not just the chosen tokens. This is the strongest correctness guarantee the federation can provide — every floating-point output is the same, regardless of timing.

Setup

8-worker federation, each worker handling a slice of FractalMoE 30M layers.
100 distinct inference prompts, each generating 50 tokens.
For each prompt, generate under all three throttle profiles. Compare token sequences and per-token log-probs.
The throttling is implemented as artificial sleep injected into the worker's forward path; the actual computation is unchanged, only the timing.
Greedy decoding (no temperature sampling) — this isolates scheduler determinism from sampling-noise determinism.

Result — STRICT PASS

Bit-exact tokens across no-throttle, steady-throttle, and ramp-throttle. Across 100 inference runs × 3 throttle profiles × 50 tokens per run = 15,000 token comparisons, 0 divergences. Per-token log-probabilities are also bit-exact across all three profiles.

The scheduler is throttle-invariant. The federation's inference pipeline is deterministic under heterogeneous worker speeds.

How the scheduler achieves throttle-invariance

The scheduler implementation uses content-addressable wait — for each pipeline stage, the scheduler waits for the specific contribution from the specific worker, identified by a content hash. It does not accept "first response" or "majority response"; it waits for the cryptographically-identified expected contribution.

This makes the scheduler indifferent to the order in which contributions arrive. Worker A might be faster than worker B in one run and slower in another; the scheduler still produces the same output because it waits for both contributions and combines them in a fixed order, regardless of arrival time.

The cost of this design is that the slowest worker bottlenecks the pipeline. If worker A is 10× slower than the others, every output token waits for worker A. This is the correctness-vs-throughput tradeoff: throttle-invariance guarantees correctness but doesn't guarantee throughput. Bets on throughput optimisation (e.g. speculative decoding, expert prefetching) are separate concerns; this bet validates that whatever throughput optimisations the federation adds, they preserve the throttle-invariance property.

Reframing — what this does and does not validate

The original framing of this bet was "throttle-invariance validates phone-class deployment." A reviewer pushed back on this framing as overstating what the bet measures. The corrected framing is more careful.

What Bet 45 validates:

The scheduler tolerates slow workers correctly at the protocol level.
Pipelines remain deterministic when workers run at different speeds.
The aggregation logic isn't reading "first to respond" as "correct response."
The federation's transparency story (Bet 18) is reproducible — the audit log is meaningful.

What Bet 45 does NOT validate:

Phone-class deployment in production. Phone deployment requires substantially more validation:
- On-device thermal testing. Sustained inference on a phone causes the SoC to throttle for thermal reasons, sometimes severely. The throttle profile is dynamic and not always smooth. The bet's artificial throttling is uniform; real phone throttling is bursty.
- OS process kill behaviour. Android and iOS aggressively kill background processes to free memory. The federation must survive being killed and restarted mid-inference, which is a recovery question (Bets 41-43) not a throttle question.
- Memory pressure. A phone running federation alongside the user's normal apps will be evicted from memory frequently. The federation's memory footprint and resilience to eviction matter.
- Sustained-load hardware degradation. Does running federation training on a phone wear the device out faster? Battery degradation, thermal cycling effects on the SoC, etc. Not addressed.
- User-perceived latency. Even if the federation is correct, the latency on a phone may be unacceptable to the user. UX-level validation.

The Open Questions chapter has on-device phone validation as an explicit open question, gated on Bet 45 (scheduler correctness) and Bet 63 (numerical robustness) but not satisfied by either of them. Phone deployment requires the open work plus likely a Kerala IT@School-style pilot to measure the unmeasurable-in-simulation properties.

Why the reframing matters

The bet's STRICT PASS is a strong result. The temptation is to frame strong results in the most expansive way possible — "throttle-invariance validates phone deployment" sounds confident and broad. The corrected framing — "throttle-invariance validates one of several preconditions for phone deployment" — sounds modest and narrow.

The catalogue's empirical discipline favours the narrow framing. A bet's claim should match exactly what the bet measures, no more and no less. Overstating what a bet validates erodes trust in the catalogue: a reviewer who finds one overstated claim will scrutinise every other claim more harshly. Understating is also bad — it leaves on the table the legitimate strength of the bet.

The fix here was to be precise about the precondition relationship. Bet 45 is a necessary condition for phone deployment — a federation that fails throttle-invariance cannot deploy to phones. It is not a sufficient condition — passing throttle-invariance doesn't mean phone deployment is ready.

This is the kind of housekeeping that keeps the catalogue calibrated. The catalogue's value comes from its claims being trustworthy at face value. When a bet's framing is wrong, fixing the framing is more important than producing more bets.

Run command

PYTHONPATH=src python -m experiments.bets.45_throttle

Output: experiments/bets/results/45_throttle.json records the per-prompt token sequences under each throttle profile, the per-token log-probabilities, and the comparison results (bit-exact match flags). Plus the per-worker timing profiles, so a reader can verify that the throttling actually affected timings (otherwise the bet would be measuring nothing).

Bet 41-43: pipeline recovery and fault tolerance. The complementary correctness story — workers can fail, not just slow down.
Bet 18: glass-box LLM. Throttle-invariance is what makes the audit log reproducible; without it, the audit log is timing-dependent.
Bet 44: Byzantine-robust aggregation. Different fault model (adversarial vs slow); also part of the federation's correctness story.
Bet 63: numerical robustness. Different precondition for phone deployment (handling fp16 round-off correctly).

Why it matters

Bet 45 establishes that the federation is scheduler-correct under heterogeneous worker speeds — a necessary condition for any deployment that mixes desktop, laptop, and phone-class participants. A federation that fails this would have nondeterministic outputs under realistic load profiles, breaking transparency and reproducibility.

It is not sufficient evidence for phone deployment, and the framing has been tightened to say so explicitly. The Open Questions chapter is clear about what additional work phone deployment requires.

The methodological lesson: pre-registered criteria and post-hoc reviewer pushback are both load-bearing. The bet's STRICT PASS was clean on the pre-registered criterion (bit-exact tokens across throttle profiles). The reviewer's pushback was on the framing, not the measurement. The right response was not to relitigate the measurement but to clarify what the measurement does and doesn't claim. The catalogue's discipline is to update framings when warranted, not just measurements.

That discipline is what makes the catalogue's claims durable. A reader scanning the catalogue can trust that "STRICT PASS" means exactly what it measures, with the framing calibrated to match. The bet's narrative — strong measurement, tightened framing, explicit gates to open questions — is a model for how the catalogue handles the gap between what bets validate and what deployments require.