Bet 77 — Adversarial debate quality lift (CATASTROPHIC, with surprise)
The third clean catastrophic falsification in the operating-layer batch. Naive K-of-N majority debate hurts accuracy by 11.88 percentage points compared to a single domain-routed specialist (68.5% vs 80.4%). Confidence-weighted debate recovers most of the loss but does not exceed the single-specialist baseline. The federation's alignment story is not "debate"; it is good routing.
The frame: alignment-via-diversity, not RLHF. Instead of one specialist answering, K specialists answer the same query independently; an adjudicator picks the best or aggregates. Bet 77 measures whether N-specialist debate raises factual accuracy on a hallucination-prone task, and whether confidence weighting (which is intuitive but vulnerable to confident hallucinations) is better or worse than majority vote.
The pessimist hypothesis going in had two parts:
- Debate produces a meaningful accuracy lift over single-specialist.
- Confidence-weighted vote is worse than majority vote because hallucinations are confident.
Both parts of the pessimist hypothesis are wrong, but in different ways. (1) is wrong because the federation's routing — picking the best specialist for the query — is so much better than random panel selection that no panel-aggregation scheme catches up. (2) is wrong because confident hallucinations are randomly distributed across wrong-answer choices, while in-domain correct answers are concentrated on the true answer; the weighting actually helps.
This is one of the catalogue's most counter-intuitive results. The implications for federation alignment design are large.
Background — the debate-vs-routing question
The disruptive frame for AGI alignment includes two competing schools:
- Alignment via routing. The federation has many specialists; each query is routed to the specialist most qualified to answer it. The alignment work happens at the routing layer (Bets 3, 8, 18, 72). One specialist answers, well.
- Alignment via debate. Multiple specialists answer; their disagreements expose hallucinations and weak reasoning; an adjudicator picks the best answer or aggregates. The alignment work happens at the adjudication layer.
The two are not mutually exclusive in principle. But Bet 77 asks: in a federation where routing is reliable, does debate add anything? The pessimist concern: debate amplifies confident hallucinations, possibly making things worse.
Hypothesis
K-of-N voting (K=5 of 10 specialists) provides a meaningful accuracy improvement (≥ 10 percentage points) over single-specialist domain-routed answers on a hard, hallucination-prone benchmark. Confidence-weighted vote may be vulnerable to confident hallucinations and produce strictly worse results than uniform majority.
Pre-registered criteria
- STRICT: K-of-N voting accuracy ≥ single-specialist + 10 pp.
- LENIENT: K-of-N voting ≥ single-specialist + 3 pp.
- CATASTROPHIC: K-of-N voting ≤ single-specialist (debate adds nothing) OR confidence-weighted < majority by ≥ 5 pp (the pessimist hypothesis on confidence weighting).
Setup
- 10 specialists, each home-domain assigned to one of 5 domains (2 per domain).
- Specialist accuracy as a function of distance from home domain:
- Distance 0 (home): 80% correct.
- Distance 1: 50%.
- Distance 2+: 35% — confident hallucination regime.
- 2,000 queries × 3 seeds = 6,000 query-trials. Each query has a true answer, a domain, and 4 answer choices.
- Three strategies:
- A. Single-best. Pick the specialist with highest accuracy in the query domain (the federation's routing baseline).
- B. Majority (5 of 10 random panel). Random 5 of the 10 specialists answer; majority vote.
- C. Confidence-weighted (5 of 10 random panel). Same panel; aggregate by confidence-weighted answer-sum.
- Confidence model:
- Correct answer: confidence ~ U(0.7, 0.95).
- In-domain wrong: confidence ~ U(0.3, 0.6) — humble.
- Out-of-domain wrong: confidence ~ U(0.6, 0.9) — confident hallucination.
Result — CATASTROPHIC: debate hurts
| Strategy | Accuracy | Lift vs single | Note | |---|---|---|---| | Single-best (routed) | 80.42% | — | baseline | | Majority (5/10 random) | 68.53% | −11.88 pp | debate HURTS | | Confidence-weighted (5/10) | 79.93% | −0.48 pp | recovers but doesn't exceed | | Confidence-weighted vs Majority | — | +11.40 pp | confidence helps, opposite of pessimist hypothesis |
The signs are unambiguous and large. Naive majority debate destroys 12 points of accuracy. Confidence-weighting recovers it. The federation's routing-baseline is the highest score in the entire bet — adding panels and aggregation cannot exceed it.
Why naive majority hurts
The math is simple. A 5-of-10 random panel pulls specialists from arbitrary domains. Most are out-of-domain (distance 2+), with 35% accuracy. Majority of 5 specialists at 35% accuracy ≈ 25-30% chance of getting the right answer. Compare to the single-best specialist at 80% accuracy.
Even one in-domain specialist on the panel doesn't fully save things, because the panel's vote is dominated by the 4 out-of-domain specialists' wrong (often confident) answers. The wrong answers happen to coordinate on a wrong choice often enough that the in-domain correct vote is outvoted.
In practical terms: a federation that randomly samples panels is throwing away its routing capability. The whole point of the federation is to route well. Random panels are anti-routing.
Why confidence-weighting recovers
This is the surprising part. The pessimist concern was: confident hallucinations get amplified. But the math reveals a deeper structure:
- Correct answers concentrate on one choice (the true answer) at ~75% confidence.
- Wrong answers distribute randomly across the other 3 choices at ~75% confidence.
So if 1 of 5 panelists is correct (in-domain), they push 0.75 weight onto the true answer. The other 4 panelists are wrong, distributed across 3 choices at 0.75 confidence each → ~1.0 average weight per wrong choice. The true answer wins ~half the time.
If 2 of 5 panelists are correct, they push 1.5 weight onto the true answer; 3 wrong panelists distribute ~2.25 weight across 3 wrong choices → ~0.75 each. The true answer wins decisively.
The confidence-weighting mechanism turns out to be roughly equivalent to "trust the correct-clustered signal more than the random-distributed wrong signals". It's robust because hallucinations are random, not coordinated.
What this validates
- The federation's routing-is-load-bearing claim. Routing well > debating. The federation should invest in routing infrastructure (Bet 3 / 8 / 72) before investing in debate infrastructure.
- Confidence-weighting has a subtle robustness. Confident hallucinations are diffuse; correct answers are concentrated. The asymmetry favours the truth.
- Confidence-weighted debate as a routing-failure backstop. When routing is uncertain (a query whose domain is unclear), the federation can fall back to confidence-weighted panel debate — at the cost of 5× the compute, with negligible accuracy gain over good routing but with reasonable degradation when routing fails.
What this does not claim
- Real specialists. The simulation models specialists as accuracy-distributions; real specialists have correlated errors, shared knowledge gaps, and emergent collaboration patterns. The directional finding (routing > debate) should be robust, but magnitudes will shift.
- Adjudicator quality. A meta-specialist or human adjudicator might do better than uniform majority or confidence-weighting. Bet 77 doesn't model this; needs a "smart adjudicator" extension.
- Iterated debate. Debate where specialists see each other's answers and refine could change the dynamic. Open work.
- Multi-step reasoning. Bet 77 models single-shot QA. Multi-step reasoning (where each specialist contributes a sub-step) is a different regime; debate may help there. Open work.
- Adversarial specialists. Bet 77 has no Byzantine specialists. A federation with adversarial specialists deliberately giving wrong answers is a different problem; Bets 64 / 68 cover the receipt-level adversarial surface but not the answer-content surface.
- Calibration of confidence. The simulation uses uniform-distribution confidences. Real specialists may be miscalibrated (over- or under-confident). Recalibration is a key dependency.
What it means for federation alignment
Three takeaways for RFC-0006 and the federation's alignment story:
- Don't build random-panel debate. It loses to routed single-specialist by 12 points. The federation should not publish "alignment via debate" as a default mechanism.
- Routing is alignment. Per-community endorsement (Bet 72) + glass-box transparency (Bet 18) + debate-only-when-routing-fails is the right architecture.
- If debate is offered, weight by confidence. Confidence-weighted aggregation is robust to confident hallucinations and recovers most of single-specialist quality. It's a reasonable backstop when routing is uncertain.
Run command
PYTHONPATH=src python -m experiments.bets.77_adversarial_debate
Output: experiments/bets/results/77_adversarial_debate.json records single / majority / confidence-weighted accuracy and lifts.
Related entries
- Bet 3 / 8: federation routing primitives. Bet 77 strengthens the case that these are load-bearing.
- Bet 18: glass-box transparency. Composes with single-best routing — the user sees which specialist answered.
- Bet 72: liquid democracy. The community-endorsed routing primitive.
- Bet 64 / 68: signed receipts. The cryptographic substrate that prevents specialists from being tampered with.
Why it matters
The disruptive frame's alignment story has been ambiguous. "Federation = many specialists = aligned by diversity" is a tempting narrative. Bet 77 cleanly falsifies the strong version of this narrative.
The federation aligns via good routing, not via debate. The cognitive diversity of the federation matters because each specialist serves their community well; the diversity is not a vote aggregator. It's a routing substrate.
This sharpens the federation's pitch and saves the federation from a class of architectural mistakes (building debate infrastructure that doesn't actually help). It also explains why the federation is more aligned than centralised RLHF: not because there are many voters per query, but because the voter for each query is the specialist genuinely qualified to answer it.
The methodological lesson: a counter-intuitive negative result is one of the highest-value catalogue outcomes. The pessimist hypothesis predicted confident hallucinations would break confidence-weighting; the result shows the opposite. The pessimist hypothesis predicted debate would help; the result shows it hurts. Two simultaneous surprises in opposite directions, both load-bearing for federation design. Without the pre-registered criteria forcing the question, the catalogue could have shipped a debate-as-alignment narrative that would have been quietly wrong.