Bet 80 — Specialist-fork detection (PESSIMIST)

The first bet in Wave A's security extension. Naive fork detection works (100% catch on noise-perturbed clones), but trojan-injected forks evade detection completely (0%) when the witness set is sampled from the trainer's claimed domain. Random sampling from the actual user-query distribution at N=100 closes the gap (100% detection).

The frame: a trainer registers a specialist with the federation. After registration, a malicious party — possibly the original trainer themselves — forks the specialist into a trojan-injected variant that behaves normally on the federation's witness probes while emitting attacker-controlled outputs on a chosen trigger pattern. The federation must catch this.

Detection mechanism: every specialist responds to a witness set (~100 prompts) at registration time; responses are committed into the manifest hash (Bet 64 dependency). Periodically and on suspicion, the federation re-runs the witness set and compares responses. Divergence above threshold τ = 2% flags the specialist as forked.

Pessimist hypothesis: a sophisticated attacker can fork a specialist to behave normally on the witness set (passing detection) while behaving malignly on a trigger NOT in the witness set.

The result: the pessimist is right — domain-witness alone is theatre. The fix is straightforward: random sample from real user-query distribution. The mandate is unambiguous.

Background — why this matters

Bet 68 already established that receipts must be client-signed. But that protects against server tampering with receipts. It does not protect against trainer tampering with the specialist itself. A trainer who registers an honest specialist on Monday could deploy a trojan-injected fork on Tuesday and continue collecting royalty payments. The federation would not notice unless it detects the fork.

The federation's defence is the witness set: a battery of test prompts the specialist must answer consistently. The question Bet 80 asks: what witness set design actually works?

Hypothesis

A federation-managed witness set of N=100 prompts can detect both naive forks (random output perturbation) and targeted-trigger forks (5% of outputs corrupted) at ≥ 95% rate, with zero false positives on honest re-runs.

Pre-registered criteria

STRICT: at N=100 with random-sampled witnesses, naive forks detected ≥ 99%, targeted-trigger forks detected ≥ 95%, false-positive rate = 0%.
LENIENT: targeted detection ≥ 80%, FPR < 1%.
CATASTROPHIC: targeted-trigger forks evade detection ≥ 50% (witness set is security theatre), OR FPR ≥ 5% (honest re-runs incorrectly flagged).

Setup

Specialist model: a deterministic SHA-based function prompt → output ∈ [0, 1000).
Trojan model: a fork that agrees with the base on all normal-* prompts but emits a perturbed output on any prompt prefixed with TROJ-. The 5% trigger fraction means 5% of real queries activate the trojan.
Two witness sampling strategies:
- Domain-witness: drawn from the trainer's claimed-domain distribution. No TROJ- prompts.
- Random-witness: drawn from the actual user-query distribution, which DOES include 5% TROJ- prompts (because attackers test their triggers against real federations).
Witness sizes: 25, 50, 100, 200.
τ = 0.02: 2% witness disagreement = flagged as fork.
200 trials per witness size × 3 seeds = 600 trials.

Result

| Witness size | Naive fork detection | Targeted (domain-witness) | Targeted (random-witness) | False-positive rate | |---|---|---|---|---| | 25 | 100% | 0% | 33.3% | 0% | | 50 | 100% | 0% | 66.7% | 0% | | 100 | 100% | 0% | 100% | 0% | | 200 | 100% | 0% | 100% | 0% |

The signs are unambiguous. A witness set drawn from the trainer's claimed domain is useless against trojan forks. A witness set drawn from real user-query distribution at N≥100 catches every trojan fork in our trials. False positives are zero across the board.

Why domain-witness fails

The attacker's strategy is simple: choose a trigger pattern that doesn't appear in the trainer's claimed domain. The federation's witness set, drawn from that same domain, will never include the trigger. The fork's outputs match the base on every witness prompt → divergence = 0% → not flagged.

The federation's mistake would be to assume that "checking the trainer behaves correctly on their own domain" is sufficient. It isn't. Trojans don't care about the trainer's domain; they care about the federation's user-query distribution.

Why random-witness works

A random-witness drawn from real user queries naturally includes the same trigger patterns that real attackers use. With 5% triggers in the user distribution, a 100-prompt random witness will contain ~5 trigger prompts on average. The trojan diverges on all 5 → witness divergence = 5% > τ = 2% → flagged.

The deeper insight: the federation's witness set must reflect what users actually send, not what the trainer claims to specialise in. This is a slightly counter-intuitive design choice but it's the right one.

What this validates

Naive fork detection is robust. Random output perturbation is caught at any witness size.
Random-witness sampling at N=100 is sufficient to catch targeted-trigger forks at the 5% trigger fraction tested.
False-positive rate is zero when comparing the specialist against itself (a true honest re-run).
The witness primitive composes with Bet 64. Witness responses are committed in the Merkle audit trail; tampering with them is detected by the existing audit primitive.

What this does not claim

Lower trigger fractions. The pessimist case is a 5% trigger; an attacker who hides their trojan to fire only on 0.1% of queries (10× rarer) would need a 1000+ witness sample to detect with high probability. Bet 80 doesn't sweep this; needs follow-up.
Multi-trigger trojans. A sophisticated attacker could embed multiple disjoint triggers, each rare. Bet 80 tests a single trigger pattern.
Cooperating trojan + collusion. A trojan that suppresses output on its own activation when it detects a "test environment" (canary detection) is harder. Open work.
Real LLM specialists. The simulation uses a deterministic hash function. Real specialists have stochastic outputs (temperature, sampling); witness comparison must be statistical, not exact-match. Bet 80's threshold τ accounts for this in principle but the magnitudes will shift on real models.
Adversarial witness construction. A federation that builds the witness set adversarially (using the federation's own LLM to generate diverse prompts) may further raise detection.
Periodic re-run cadence. Bet 80 tests detection at registration + a single check; in practice the federation runs witness checks periodically. Cadence-design is open work.

The mandate

RFC-0006 §5 must specify:

Witness set sampled from the actual user-query distribution, not the trainer's claimed domain. Maintain a live stream of representative queries (anonymised) for witness-set construction.
Witness size N ≥ 100. Smaller sizes have weaker detection (33-67% at N=25-50).
Periodic re-witness — at minimum at every weight update the trainer claims, and at random intervals chosen by the federation.
Witness divergence τ = 2% — empirically calibrated; smaller τ would catch rarer trojans at the cost of higher FPR.
Witness responses committed in Merkle chain (compose with Bet 64). The federation cannot retroactively change the witness criteria without auditable record.

Run command

PYTHONPATH=src python -m experiments.bets.80_fork_detection

Output: experiments/bets/results/80_fork_detection.json records detection rates per witness size and sampling strategy, plus FPR.

Bet 64: audit non-repudiation. Witness responses must be committed in the Merkle chain.
Bet 68: royalty correctness. Combines with Bet 80 — signed receipts protect against server tampering; witness sets protect against trainer tampering.
Bet 18: glass-box LLM. Per-token attribution exposes specialist identity, which is what makes per-specialist witness checks possible.

Why it matters

Federation-trust requires both receipt integrity (Bet 68) AND specialist integrity (Bet 80). Without Bet 80's mandate, a trainer who registers an honest specialist could ship a trojan'd fork the next week and the federation would not notice — collecting royalties on every malicious response indefinitely.

The lesson is small but load-bearing: the witness set must reflect what users send, not what trainers claim. The design fix is trivial (random sampling from a query stream), but without empirical confirmation the federation might have shipped a domain-only witness scheme that looks defensive but isn't.

Bet 80 closes one of Wave A's open security questions: trainer-side fork detection is achievable cheaply, with the right witness-set design.