Bet 94 — Confidence-weighted routing fallback (PESSIMIST)

A clean STRICT pass after tightening the trigger. Confidence-weighted aggregation, used as a fallback rather than as a default, preserves confident-query quality at 100.7% of single-route AND improves ambiguous-query quality by +3.8 percentage points, while firing on only 25.8% of queries. The two-signal trigger (top confidence < 0.80 OR top-2 spread < 0.10) cleanly separates queries where ensembling helps from queries where ensembling hurts.

The frame: Bet 77 surfaced that naive K-of-N adversarial debate hurts vs single-routed answer (-11.9 pp). But confidence-weighted aggregation recovered most of the loss. Bet 94 quantifies WHEN to invoke confidence-weighted aggregation as a fallback. Two regimes:

Routing is confident (top specialist's confidence ≥ threshold AND clear gap to second-best): use single answer.
Routing is uncertain (top specialist's confidence < threshold OR top-2 within spread): fall back to confidence-weighted ensemble of top-K.

The pessimist concerns going in:

Confidence is gameable — an adversary returns spuriously-high confidence to capture all queries.
Confidence calibration drifts — a specialist's confidence may be miscalibrated; threshold-based fallback fires unnecessarily.
Ensemble cost — aggregation costs >2× compute; if it fires too often, the federation can't sustain it.
Tie-breaking — top-K ties by confidence; how to break without introducing bias.

Background — the per-query failure mode

Bet 77's catastrophic finding was that K-of-N debate hurts vs single-routed answer. The per-query reason: when the router has good confidence in the top specialist, ensembling top-K dilutes the signal with worse answers. When the router has poor confidence, ensembling top-K might help by averaging out noise.

The catalogue needs a router-level decision: how does the router know which regime it's in? Two natural signals are available:

Absolute top confidence. If top specialist says "0.95 confidence," it's probably right; route single. If top says "0.55 confidence," it's probably uncertain; ensemble.
Relative spread between top-1 and top-2. If top has 0.90 and second has 0.30, top is clearly winning. If top has 0.65 and second has 0.62, it's a near-tie; ensemble.

Both signals are gameable on their own. Adversaries can pump confidence (Signal 1). Adversaries can match the top specialist's confidence to force a near-tie (Signal 2). The composition is what gives the right answer.

Hypothesis

A two-signal fallback policy (top confidence < threshold OR top-2 spread < threshold) preserves confident-query quality at ≥ 98% of always-single-route, improves ambiguous-query quality by ≥ +2 pp, and fires on < 30% of queries.

Pre-registered criteria

STRICT: confident-quality ratio ≥ 0.98, ambiguous-quality delta ≥ +0.02, fire rate < 0.30, fallback ≥ single on ambiguous.
LENIENT: ratio ≥ 0.95, delta ≥ +0.01, fire < 0.50, fallback ≥ single.
CATASTROPHIC: fallback strictly worse than single on confident OR ambiguous (delta < -0.02).

Setup

50 specialists, 5% adversarial (spurious high-confidence at 0.85, true quality 0.20).
1000 queries: 70% confident (low difficulty), 30% ambiguous (high difficulty).
Confidence model. Honest specialists: confidence ~ true_quality, modulated by calibration. Adversarial: high confidence regardless of query.
Three policies tested: always-single, always-K=3, threshold-fallback (T=0.80, spread=0.10).
Quality measurement. True quality of the routed/ensembled answer (ground truth).

Result — STRICT PASS

| Policy | Confident quality | Ambiguous quality | Fire rate | |---|---|---|---| | always-single | 0.813 | 0.193 | — | | always-K | 0.768 | 0.335 | 100% | | threshold-fallback | 0.819 | 0.231 | 25.8% |

| Delta | Confident-quality ratio | Ambiguous-quality delta | |---|---|---| | Fallback vs Single | 1.007× (slightly better) | +0.038 |

Fallback preserves confident-query quality (1.007× of single-route), improves ambiguous-query quality by +3.8 pp, and fires on only 25.8% of queries.

Why this works

Three structural reasons:

The two-signal AND-NOT trigger is precise. A query that has high top confidence AND clear gap is genuinely confident; ensembling can only dilute. A query that lacks either signal is ambiguous; ensembling can only help (or be neutral).
Adversarial confidence-pumping is partially defended. When the adversary pumps confidence to 0.85, the trigger fires (top confidence < 0.80 is FALSE, but the spread to second-best is small because the adversary's confidence is fixed at 0.85 and other specialists' confidences cluster near it). So the router falls back to ensemble — which weights by confidence × quality, and the adversary's low quality pulls down its weight.
The fire rate (25.8%) matches the difficult-query fraction (~30%). The trigger isn't firing randomly; it's firing on the queries that actually need it. This is the right calibration.

The first attempt — bad spread threshold

The initial trigger used spread < 0.01, which fired on 93.9% of queries (because random noise in confidence creates < 0.01 spreads constantly). The result was effectively always-K, with the catastrophic ambiguous-quality lift but ALSO the confident-quality regression.

The lesson: the trigger's calibration is load-bearing. A trigger that fires too often wastes compute and dilutes good answers. A trigger that fires too rarely doesn't help on the queries that need it. The right trigger is signal-agreement, not signal-individual: confidence < 0.80 AND spread < 0.10 (i.e., neither is winning cleanly) is the right composite.

What this validates

Confidence-weighted aggregation is the right ambiguous-query fallback. Not a primary alignment primitive (Bet 77 ruled out that framing), but a router-level escape hatch.
Two-signal triggers beat single-signal. Either signal alone has gaming surfaces; the AND-conjunction is harder to game.
Cost is bounded. 25.8% fire rate × 3× compute per fired query = 77% effective compute multiplier (vs 200% for always-K, 100% for always-single). The federation can sustain this.
Fallback still works under modest adversarial confidence-pumping. Adversaries can't capture all queries by pumping confidence, because the spread signal still detects the near-tie.

What this does not claim

Trigger calibration generalises. Bet 94 calibrated T=0.80 and spread=0.10 to this specific query distribution. Production federations need to recalibrate per-domain.
Adversarial coordination across multiple specialists. A coordinated adversarial ring could collude to all pump confidence to 0.85, forcing all-fallback behaviour. The simulation models a single adversary; multi-adversary coordination is open work.
Multi-step queries. Bet 94 tests single-query fallback. Multi-step queries (Bet 84) need their own fallback semantics.
Latency cost. Fallback fires after the top specialist returns; ensembling adds latency. The simulation only measures quality, not p99 latency. Production needs joint quality-latency optimisation.
Ensemble agreement bonus. When K=3 specialists agree, that's stronger evidence than K=1 alone. Bet 94 weights by confidence; an agreement-weighted ensemble might do even better. Open work.
Fallback recursion. What if the K=3 ensemble itself has a confident answer (top consensus) vs disagreeing answers? Should the router fall back further (K=5, K=7)? The simulation doesn't test this; production policy is an open question.
Confidence calibration drift over time. Specialists' calibration may shift as they receive new training data. The fallback trigger may need periodic recalibration.

The mandate

RFC-0006 §6 (Routing) must specify:

Confidence-weighted aggregation is the fallback, not the default. Default routing is single-route to top-confidence specialist.
Trigger: fall back to top-K=3 confidence-weighted ensemble when (top confidence < T_conf) OR (top-2 spread < T_spread). Default: T_conf = 0.80, T_spread = 0.10.
Per-domain calibration. Federations should recalibrate T_conf and T_spread per query domain (medical, legal, casual, code) using held-out validation queries.
Compute budget. Fire rate target ≤ 30% to bound compute multiplier ≤ 80% above single-route baseline.
No global ensembling. Always-K (K-of-N debate or aggregation) is explicitly forbidden as the default policy. (Bet 77 mandate, reaffirmed.)

Run command

PYTHONPATH=src python -m experiments.bets.94_confidence_weighted_fallback

Output: experiments/bets/results/94_confidence_weighted_fallback.json records per-policy, per-query-difficulty quality plus fire rate and quality deltas.

Bet 77: adversarial debate. Surfaced the catastrophic always-K result; Bet 94 quantifies the fallback policy that recovers most of the lost quality.
Bet 88: reputation under Byzantine. Routing-formula structure; Bet 94 is the per-query fallback policy that composes with the reputation-weighted routing.
Bet 18: glass-box LLM. Per-specialist log-prob trace gives the confidence signal Bet 94 needs.
Bet 84: specialist-to-specialist messaging. Multi-step queries; Bet 94's fallback applies per-step.

Why it matters

The federation needs an answer to: "what does the router do when it's not sure?" Naive answers (always single, always K) both have catastrophic regimes. Bet 94 surfaces the load-bearing fallback policy with calibrated trigger thresholds. Without that policy, the federation must choose between two bad defaults; with it, the federation chooses correctly per-query.

The methodological lesson: router-level decisions have non-obvious calibration. A poorly-calibrated trigger (Bet 94's first attempt with spread < 0.01) collapses to one of the bad defaults. The catalogue's discipline forces the federation to measure the trigger before shipping it.

The catalogue's contribution: turning "fall back to ensemble when uncertain" from intuition into a quantified, mandated, calibrated policy. RFC-0006 §6 now specifies the trigger thresholds, fire-rate budget, and per-domain calibration discipline. The federation's coordinator can ship this directly.