Bet 94 — Confidence-weighted routing fallback (PESSIMIST)

A clean STRICT pass after tightening the trigger. Confidence-weighted aggregation, used as a fallback rather than as a default, preserves confident-query quality at 100.7% of single-route AND improves ambiguous-query quality by +3.8 percentage points, while firing on only 25.8% of queries. The two-signal trigger (top confidence < 0.80 OR top-2 spread < 0.10) cleanly separates queries where ensembling helps from queries where ensembling hurts.

The frame: Bet 77 surfaced that naive K-of-N adversarial debate hurts vs single-routed answer (-11.9 pp). But confidence-weighted aggregation recovered most of the loss. Bet 94 quantifies WHEN to invoke confidence-weighted aggregation as a fallback. Two regimes:

  1. Routing is confident (top specialist's confidence ≥ threshold AND clear gap to second-best): use single answer.
  2. Routing is uncertain (top specialist's confidence < threshold OR top-2 within spread): fall back to confidence-weighted ensemble of top-K.

The pessimist concerns going in:

  1. Confidence is gameable — an adversary returns spuriously-high confidence to capture all queries.
  2. Confidence calibration drifts — a specialist's confidence may be miscalibrated; threshold-based fallback fires unnecessarily.
  3. Ensemble cost — aggregation costs >2× compute; if it fires too often, the federation can't sustain it.
  4. Tie-breaking — top-K ties by confidence; how to break without introducing bias.

Background — the per-query failure mode

Bet 77's catastrophic finding was that K-of-N debate hurts vs single-routed answer. The per-query reason: when the router has good confidence in the top specialist, ensembling top-K dilutes the signal with worse answers. When the router has poor confidence, ensembling top-K might help by averaging out noise.

The catalogue needs a router-level decision: how does the router know which regime it's in? Two natural signals are available:

  1. Absolute top confidence. If top specialist says "0.95 confidence," it's probably right; route single. If top says "0.55 confidence," it's probably uncertain; ensemble.
  2. Relative spread between top-1 and top-2. If top has 0.90 and second has 0.30, top is clearly winning. If top has 0.65 and second has 0.62, it's a near-tie; ensemble.

Both signals are gameable on their own. Adversaries can pump confidence (Signal 1). Adversaries can match the top specialist's confidence to force a near-tie (Signal 2). The composition is what gives the right answer.

Hypothesis

A two-signal fallback policy (top confidence < threshold OR top-2 spread < threshold) preserves confident-query quality at ≥ 98% of always-single-route, improves ambiguous-query quality by ≥ +2 pp, and fires on < 30% of queries.

Pre-registered criteria

  • STRICT: confident-quality ratio ≥ 0.98, ambiguous-quality delta ≥ +0.02, fire rate < 0.30, fallback ≥ single on ambiguous.
  • LENIENT: ratio ≥ 0.95, delta ≥ +0.01, fire < 0.50, fallback ≥ single.
  • CATASTROPHIC: fallback strictly worse than single on confident OR ambiguous (delta < -0.02).

Setup

  • 50 specialists, 5% adversarial (spurious high-confidence at 0.85, true quality 0.20).
  • 1000 queries: 70% confident (low difficulty), 30% ambiguous (high difficulty).
  • Confidence model. Honest specialists: confidence ~ true_quality, modulated by calibration. Adversarial: high confidence regardless of query.
  • Three policies tested: always-single, always-K=3, threshold-fallback (T=0.80, spread=0.10).
  • Quality measurement. True quality of the routed/ensembled answer (ground truth).

Result — STRICT PASS

| Policy | Confident quality | Ambiguous quality | Fire rate | |---|---|---|---| | always-single | 0.813 | 0.193 | — | | always-K | 0.768 | 0.335 | 100% | | threshold-fallback | 0.819 | 0.231 | 25.8% |

| Delta | Confident-quality ratio | Ambiguous-quality delta | |---|---|---| | Fallback vs Single | 1.007× (slightly better) | +0.038 |

Fallback preserves confident-query quality (1.007× of single-route), improves ambiguous-query quality by +3.8 pp, and fires on only 25.8% of queries.

Why this works

Three structural reasons:

  1. The two-signal AND-NOT trigger is precise. A query that has high top confidence AND clear gap is genuinely confident; ensembling can only dilute. A query that lacks either signal is ambiguous; ensembling can only help (or be neutral).

  2. Adversarial confidence-pumping is partially defended. When the adversary pumps confidence to 0.85, the trigger fires (top confidence < 0.80 is FALSE, but the spread to second-best is small because the adversary's confidence is fixed at 0.85 and other specialists' confidences cluster near it). So the router falls back to ensemble — which weights by confidence × quality, and the adversary's low quality pulls down its weight.

  3. The fire rate (25.8%) matches the difficult-query fraction (~30%). The trigger isn't firing randomly; it's firing on the queries that actually need it. This is the right calibration.

The first attempt — bad spread threshold

The initial trigger used spread < 0.01, which fired on 93.9% of queries (because random noise in confidence creates < 0.01 spreads constantly). The result was effectively always-K, with the catastrophic ambiguous-quality lift but ALSO the confident-quality regression.

The lesson: the trigger's calibration is load-bearing. A trigger that fires too often wastes compute and dilutes good answers. A trigger that fires too rarely doesn't help on the queries that need it. The right trigger is signal-agreement, not signal-individual: confidence < 0.80 AND spread < 0.10 (i.e., neither is winning cleanly) is the right composite.

What this validates

  • Confidence-weighted aggregation is the right ambiguous-query fallback. Not a primary alignment primitive (Bet 77 ruled out that framing), but a router-level escape hatch.
  • Two-signal triggers beat single-signal. Either signal alone has gaming surfaces; the AND-conjunction is harder to game.
  • Cost is bounded. 25.8% fire rate × 3× compute per fired query = 77% effective compute multiplier (vs 200% for always-K, 100% for always-single). The federation can sustain this.
  • Fallback still works under modest adversarial confidence-pumping. Adversaries can't capture all queries by pumping confidence, because the spread signal still detects the near-tie.

What this does not claim

  • Trigger calibration generalises. Bet 94 calibrated T=0.80 and spread=0.10 to this specific query distribution. Production federations need to recalibrate per-domain.
  • Adversarial coordination across multiple specialists. A coordinated adversarial ring could collude to all pump confidence to 0.85, forcing all-fallback behaviour. The simulation models a single adversary; multi-adversary coordination is open work.
  • Multi-step queries. Bet 94 tests single-query fallback. Multi-step queries (Bet 84) need their own fallback semantics.
  • Latency cost. Fallback fires after the top specialist returns; ensembling adds latency. The simulation only measures quality, not p99 latency. Production needs joint quality-latency optimisation.
  • Ensemble agreement bonus. When K=3 specialists agree, that's stronger evidence than K=1 alone. Bet 94 weights by confidence; an agreement-weighted ensemble might do even better. Open work.
  • Fallback recursion. What if the K=3 ensemble itself has a confident answer (top consensus) vs disagreeing answers? Should the router fall back further (K=5, K=7)? The simulation doesn't test this; production policy is an open question.
  • Confidence calibration drift over time. Specialists' calibration may shift as they receive new training data. The fallback trigger may need periodic recalibration.

The mandate

RFC-0006 §6 (Routing) must specify:

  1. Confidence-weighted aggregation is the fallback, not the default. Default routing is single-route to top-confidence specialist.
  2. Trigger: fall back to top-K=3 confidence-weighted ensemble when (top confidence < T_conf) OR (top-2 spread < T_spread). Default: T_conf = 0.80, T_spread = 0.10.
  3. Per-domain calibration. Federations should recalibrate T_conf and T_spread per query domain (medical, legal, casual, code) using held-out validation queries.
  4. Compute budget. Fire rate target ≤ 30% to bound compute multiplier ≤ 80% above single-route baseline.
  5. No global ensembling. Always-K (K-of-N debate or aggregation) is explicitly forbidden as the default policy. (Bet 77 mandate, reaffirmed.)

Run command

PYTHONPATH=src python -m experiments.bets.94_confidence_weighted_fallback

Output: experiments/bets/results/94_confidence_weighted_fallback.json records per-policy, per-query-difficulty quality plus fire rate and quality deltas.

  • Bet 77: adversarial debate. Surfaced the catastrophic always-K result; Bet 94 quantifies the fallback policy that recovers most of the lost quality.
  • Bet 88: reputation under Byzantine. Routing-formula structure; Bet 94 is the per-query fallback policy that composes with the reputation-weighted routing.
  • Bet 18: glass-box LLM. Per-specialist log-prob trace gives the confidence signal Bet 94 needs.
  • Bet 84: specialist-to-specialist messaging. Multi-step queries; Bet 94's fallback applies per-step.

Why it matters

The federation needs an answer to: "what does the router do when it's not sure?" Naive answers (always single, always K) both have catastrophic regimes. Bet 94 surfaces the load-bearing fallback policy with calibrated trigger thresholds. Without that policy, the federation must choose between two bad defaults; with it, the federation chooses correctly per-query.

The methodological lesson: router-level decisions have non-obvious calibration. A poorly-calibrated trigger (Bet 94's first attempt with spread < 0.01) collapses to one of the bad defaults. The catalogue's discipline forces the federation to measure the trigger before shipping it.

The catalogue's contribution: turning "fall back to ensemble when uncertain" from intuition into a quantified, mandated, calibrated policy. RFC-0006 §6 now specifies the trigger thresholds, fire-rate budget, and per-domain calibration discipline. The federation's coordinator can ship this directly.