Bet 44 — Byzantine-robust aggregation

A federation that accepts gradient contributions from arbitrary nodes — anyone in the open-participation pool — must tolerate adversarial contributors. A node sending nonsense gradients (intentionally or due to a bug) cannot be allowed to corrupt the global model. The default aggregation strategy in distributed training is mean of per-node gradients; this works in trusted environments and fails immediately under adversarial input. The bet validates that coordinate-wise median aggregation is robust to a 1/8 fraction of byzantine actors at extreme scale (1000× larger gradients than honest contributors), keeping the trained model within 1.084× of the byzantine-free baseline.

This bet matters for the federation's open-participation thesis. If gradient aggregation requires a permissioned member list, the federation reduces to "trust whoever runs the membership server" — defeating the community-ownership goal. With byzantine-robust aggregation, the federation can accept contributions from anyone whose hardware can run a forward-backward pass, and absorb a small fraction of bad actors without consensus-level overhead. This is what makes the federation operationally trust-minimised at the gradient layer.

Background — why mean aggregation fails

The standard distributed-training pattern is synchronous SGD with gradient averaging:

  1. Each of N nodes computes a gradient on its local mini-batch.
  2. Nodes exchange gradients (via all-reduce, parameter server, or similar).
  3. Each node averages all N gradients and applies the average to its local copy of the parameters.

This is mathematically equivalent to single-machine SGD on the union of all mini-batches; it converges nicely under the assumption that all gradients are honest. The assumption breaks when a node sends bad gradients — by accident (numerical issue, software bug, hardware glitch) or by design (an adversarial actor trying to corrupt the model).

Mean aggregation has unbounded sensitivity to outliers. A single byzantine node sending a gradient that's 1000× the magnitude of honest gradients will dominate the average — the model parameters will jump in the direction of the bad gradient on each step. Within a few dozen steps, the model diverges (loss goes to NaN or to unbounded values).

This is not a theoretical concern. In any open-participation distributed system, some fraction of contributors will send bad data. The question is what aggregation primitive is robust to this.

Why coordinate-wise median works

Coordinate-wise median aggregation replaces the mean with a per-coordinate median:

aggregated_gradient[i] = median(gradient_node_1[i], gradient_node_2[i], ..., gradient_node_N[i])

For each parameter coordinate i, take the median of the N nodes' gradients at that coordinate. The result is the per-coordinate median across the federation.

The key property: the median is bounded by the bulk of the distribution, regardless of the magnitude of outliers. A single byzantine node sending a 1000× gradient at coordinate i doesn't change the median — the median is determined by the 51st-percentile node out of N, which is an honest node as long as fewer than half the nodes are byzantine.

Theoretical bound: coordinate-wise median tolerates up to ~50% byzantine nodes (i.e. honest majority required). In practice, the bet validates at 1/8 byzantine fraction, which is the operating regime the federation cares about — a small but realistic adversarial fraction.

This is a known result in the distributed-systems literature (Yin et al., 2018, Blanchard et al., 2017, and the broader Krum / Trimmed Mean literature). The bet's contribution isn't novel theory; it's empirical validation that the theory holds for the federation's specific training pipeline (DiLoCo K-step async, 30M FractalMoE, real text data).

Hypothesis

Coordinate-wise median aggregation tolerates a 1/8 fraction of byzantine nodes (each sending gradients scaled 1000× larger than honest gradients) while keeping the global model within a small constant of the byzantine-free baseline. The combined stress test includes node churn (10% drop per round) and partition (random subgraph isolation) running simultaneously.

Pre-registered criteria

  • STRICT: median final loss ≤ 1.1× clean baseline under combined churn + byzantine + partition.
  • LENIENT: ≤ 1.5× clean baseline.
  • CATASTROPHIC: divergence (NaN or unbounded loss) under any test condition.

The combined-stress framing is deliberate. The bet doesn't test byzantine in isolation — it tests byzantine plus node churn plus network partition, all at once. This is the realistic deployment scenario: a federation in the wild has unreliable nodes (churn), unreliable network (partitions), and occasional bad actors (byzantine), all happening simultaneously. If the aggregator is only robust to one of these in isolation, it's not robust enough for production.

Setup

  • 8 nodes in a simulated federation.
  • Each node trains a FractalMoE 30M model on a slice of a real-text corpus.
  • DiLoCo K=10 inner steps per node before each aggregation round.
  • 1 node is byzantine: at every aggregation round, sends gradients scaled by 1000× (random direction, large magnitude).
  • 10% chance per round per node of "dropping" (not contributing this round; rejoin next round). Simulates churn.
  • Random subgraph isolation: at each round, randomly partition the 8 nodes into 2 isolated groups; aggregation happens within each group only. Simulates partition.
  • 50 aggregation rounds total.
  • Two conditions: mean aggregation vs coordinate-wise median aggregation.
  • Clean baseline: same setup with no byzantine, no churn, no partition; mean aggregation works fine.

The 1000× byzantine magnitude is deliberately extreme — the kind of byzantine signal that should be trivially detectable by any robust aggregator. If coordinate-wise median fails at this magnitude, it would suggest the aggregator has a more fundamental issue. If it succeeds, the aggregator is robust to the easy adversarial case (and a separate test would be needed for subtle adversaries).

Result — STRICT PASS

| Aggregator | Steps to NaN | Final loss vs clean baseline | |---|---|---| | Mean | 31 | (diverged) | | Coordinate-wise median | n/a | 1.084× |

Mean aggregation diverged in 31 steps under a single byzantine actor. This is the expected behaviour: a 1000× scaled gradient overwhelms the average within a few rounds, and the model parameters explode.

Coordinate-wise median held within 1.084× of clean baseline under combined churn + byzantine + partition. The aggregator absorbed all three stress conditions simultaneously without diverging, and the final loss is only 8.4% worse than the no-stress baseline. STRICT passes (≤ 1.1×).

What this buys

The federation can accept gradient contributions from open participation without inspecting each contribution. The aggregator absorbs adversarial inputs as long as the byzantine fraction stays below ~1/2 (theoretical bound). The federation's gradient-layer trust model becomes:

  • Trust honest majority of participants. Standard distributed-systems assumption.
  • Don't trust any individual contribution. Aggregator handles the per-contribution adversarial case.
  • No permissioned member list required. The federation can be open in the gradient layer.

This is the property that makes the federation community-ownable in operation, not just community-owned in branding. Anyone with a laptop can contribute training compute; the aggregator filters out their contribution if it's adversarial without requiring a central authority to vet them.

What this does not buy

The bet validates coarse byzantine. Three flavours of adversarial behaviour are not addressed:

  • Subtle byzantine. A byzantine node sending gradients that are scaled 1.05× larger than honest gradients (rather than 1000×) doesn't trigger median-based outlier rejection — the slightly-larger gradients influence the median over many rounds, drifting the model in an adversarial direction. The bet validated 1000× scaling; subtle (1.05×, 1.5×, 2×) byzantine is a separate research problem. Mitigations may include trimmed-mean aggregators or norm-based clipping.

  • Collusion. If a coordinated group of byzantine nodes representing more than 1/2 of the federation cooperate, they can swing the median. The bet was 1/8 single-actor; collusion at 1/3 fraction is plausible in a public federation and would require a different defence (e.g. reputation systems, stake-weighted aggregation). Open work.

  • Poisoned data. A byzantine actor sending honest-looking gradients computed on poisoned training data is invisible to median aggregation — the gradients are statistically normal at the parameter level; the corruption is in the data semantics. Data validity is a separate layer of defence (e.g. content filters, source attribution, anomaly detection on the data side). Bet 44 doesn't address this; the federation's data-validation layer is a separate concern.

The takeaway: coordinate-wise median is robust to the easy adversarial case (large-magnitude byzantine), which is the most common failure mode and the most likely accidental failure (numerical instability producing huge gradients). Subtle and coordinated adversaries require additional layers of defence.

Composition with churn and partition

A subtle property the bet validates: coordinate-wise median composes correctly with node churn and network partition. When a node drops (churn), the median is computed over the remaining nodes; when the network partitions, the median is computed within each partition independently.

This composition matters because real federations don't have separable failure modes. A node that drops mid-round is also potentially a node whose previous contributions were byzantine. A partition that isolates a byzantine node from the rest of the federation is also a partition that prevents some honest aggregation. The aggregator has to handle all of these simultaneously.

The 1.084× final-loss number is the federation's tolerance budget for combined stress. If the federation runs in the wild with this stress profile, it converges to 8.4% worse than ideal. That's the price of robustness; the alternative (mean aggregation) is unbounded divergence.

Run command

PYTHONPATH=src python -m experiments.bets.44_byzantine

Output: experiments/bets/results/44_byzantine.json records the per-round losses for both aggregators under all stress conditions, the per-round byzantine activity log, the churn events, and the partition structure for each round.

  • Bet 41-43: pipeline recovery, fault-tolerance protocols. The non-byzantine fault model.
  • Bet 45: throttle-invariance. A different fault model — slow workers — where the bet validates correctness rather than robustness.
  • Bet 50, 62: DiLoCo K-step training. The byzantine bet is about gradient aggregation; K-step is about the frequency of aggregation. They compose: K-step DiLoCo with coordinate-wise median is the federation's training default.

Why it matters

Without byzantine robustness, federation requires a permissioned member list — defeating the open-participation thesis. The federation would only accept gradient contributions from vetted participants, the vetting service would become a centralised authority, and the community-ownership story would collapse to "trust whoever runs the vetting service."

With coordinate-wise median, the federation can run under public participation as long as the threat model is "occasional malicious actor" rather than "coordinated state-level adversary." For the federation's deployment scenarios — community-owned LLM federation, Kerala IT@School with 215k contributing devices, citizen-contributed compute pools — the right threat model is the former. State-level adversaries are out of scope (and require defences beyond gradient aggregation: e.g. trusted hardware, formal verification, multi-party computation).

The bet's STRICT PASS at 1.084× under combined stress is what makes this story technically grounded rather than aspirational. The federation can absorb realistic adversarial input at the gradient layer, with a small but bounded loss penalty. Open-participation is operationally feasible.