On-device phone validation

The phone-class deployment story is one of the federation's strongest commitments — billions of phone-grade devices worldwide, each potentially a federation worker, each contributing computation and personalisation locally. The story has multiple gates. Two are closed by existing bets; the rest are open and require real-device measurement that the harness can't perform. Phone deployment is real future work, not a current capability.

The gap between "what the bets harness has measured" and "what phone deployment requires" is wide enough that the catalogue's discipline is to keep the question explicitly open. Bet 09 ran a "phone is the unit" sanity check via simulation; it suggested the per-token compute is feasible on phone-class hardware in principle. Bet 45 (throttle-invariance) confirms the scheduler tolerates slow workers correctly. Bet 63 (numerical robustness) confirms fp16 math has 4 orders of magnitude of headroom. None of these run on actual phones.

What's already established

Two preconditions for phone deployment have closed:

Bet 45 (throttle-invariance). The scheduler tolerates slow workers correctly. Bit-exact tokens across no-throttle, steady-throttle, ramp-throttle conditions over 100 inference runs. This validates the protocol behaviour under heterogeneous worker speeds — a phone running 10× slower than a desktop in the same federation produces the same output. It does not validate phone deployment itself; it only validates the scheduling primitive that phone deployment needs.
Bet 63 (numerical robustness). Hidden-state Gaussian noise up to σ_rel = 1e-1 doesn't destabilise personalisation. Phone-class fp16 math is numerically safe by 4 orders of magnitude relative to typical fp16 round-off (1e-5 to 1e-4). Whatever precision phones operate at, the federation's numerical assumptions hold.

These are positive results. They establish the federation's protocol and numerical foundations are compatible with phone-class workers.

What's still open

The closed preconditions don't include the operational challenges that actually matter for phone deployment:

Thermal-driven model eviction. Phones throttle CPU/GPU under sustained load. A phone running federation worker tasks for 15-30 minutes will hit thermal limits and either slow dramatically (the SoC clocks down by 50%+) or be killed by the OS to protect the device. The federation can tolerate slow workers (Bet 45) but the question of how slow they get under realistic thermal conditions, and whether the federation can keep them productive enough to be worth including, is unmeasured.
OS process kill behaviour. Android and iOS aggressively kill background processes under memory pressure. iOS in particular has very strict background-execution policies — apps are typically given seconds to minutes of background time before being killed. Android is more permissive but still kills background processes when foreground apps need RAM. The federation worker must survive being killed and restarted; whether it can rejoin the pipeline within latency tolerances is unmeasured.
Memory pressure under multitasking. A phone running federation alongside the user's normal apps (browser, messaging, photos, video calls) operates with severe memory contention. A 1B-parameter model in 4-bit needs ~500 MB. Add KV-cache, adapter state, runtime, browser, the messaging app — and the phone is at memory-pressure boundary. Whether the federation can hold its working set in memory under realistic multitasking is unmeasured.
Sustained-load hardware degradation. Phone hardware is not designed for hours of GPU-bound inference. Battery wear (lithium chemistry degrades faster under sustained high temperatures), thermal cycling (repeated hot-cold cycles stress the SoC packaging), storage write-amplification (frequent writes wear out the flash) are all open concerns. A federation that wears out the user's phone is a federation that won't be tolerated even if technically functional.
User-perceived latency. Even if the federation is correct and the phone is willing, the latency of getting tokens through a phone-included pipeline may be unacceptable to the user. Interactive workloads have sub-second per-token expectations; a phone in the pipeline may push that to multi-second territory. Whether users tolerate this is a UX question, not just a technical one.
Network behaviour on cellular vs Wi-Fi. Phones spend substantial time on cellular networks with different latency, throughput, and reliability profiles than residential Wi-Fi. The federation may need to adapt its protocol parameters per-network or pause federation participation during cellular use. Unmeasured.

Why the bets harness can't reach this

The bets harness runs on developer machines (M1 Max, M4 Max). It does not run on phones. To make progress here, the federation needs:

A target device. An Android phone (Pixel 8, Galaxy S24, OnePlus 12) and an iOS device (iPhone 15 Pro or iPad Pro). Bet 09 ran a "phone is the unit" sanity check via simulation; real-device testing is open. The simulation predicts feasibility but doesn't catch the operational concerns above.
A federation worker port. The Python-based worker doesn't run on iOS; an LLVM-compiled Rust or C++ worker is the realistic path for iOS. On Android, Termux + Python is feasible for prototyping but not for production deployment (Termux requires manual install; few users will go through that). A native Android app embedding the worker is the production path. This is a substantial engineering project, not a research bet.
A sustained-load test. Hours of inference under the federation, with thermal monitoring, OS-event logging (for process-kill events), and battery-drain measurement. A short test won't catch the thermal and battery effects that matter for deployment viability.
A side-by-side baseline. What does the same workload look like on a non-federation client (e.g., the user just running a local model on their phone)? The federation has to be at least competitive in user-perceived terms, not just feasible. If running a smaller model fully on-device is faster and uses less battery than the federation, the federation's value proposition for phone users is unclear.
A privacy / consent framework. Phones have intimate user data; running federation training or inference on them touches privacy-sensitive material. The federation needs an explicit consent UX, an explicit "what data leaves the phone" model, and probably an audit log the user can inspect. Bet 18's glass-box LLM provides the audit primitive; the user-facing version is open work.

What the smallest useful demo looks like

A minimum viable phone-deployment demo that would tell us where the real obstacles are:

Single phone hosting one layer of a 1B model. Not a full federation — just one worker contributing to a multi-machine pipeline. The phone receives layer-forward requests over Wi-Fi from a desktop coordinator, runs the layer, returns activations.
Stays alive for one hour. Not "production-ready" — just "doesn't crash, doesn't get killed, doesn't overheat catastrophically."
Records every relevant event. Thermal throttle events, OS pressure signals, network errors, per-token latency.
Compared to baseline. Same workload run with desktop-only workers. The phone's contribution should be measurable — even if it's just "the phone handled 10% of the tokens" — to confirm the integration is real.

That's not federation, technically — it's a single-worker offload — but it would tell us what breaks first under real device conditions, and that's the starting point for the open work.

What we'd learn

The phone-class deployment story is the strongest version of the community-owned thesis. If phones can contribute compute to a shared LLM federation, billions of devices become potential workers. The federation's hardware base shifts from "tens of millions of desktops/laptops" to "billions of phones," which changes the economics and the political case for the project entirely.

If phones can't contribute reliably, the federation's worker tier is desktops and laptops. Still a useful project — millions of devices, the per-user adapter story still applies — but with a smaller addressable hardware base than the maximalist version.

The Bet 09 simulation was optimistic. Real-device testing will be the truth. There are roughly three possible outcomes:

Phones are fine. Modern phones can sustain federation workloads at acceptable thermal, battery, and memory cost. The community-ownership thesis extends to billions of devices. This would be a major win.
Phones work for short bursts but not sustained. Useful for occasional contribution (e.g. when charging, on Wi-Fi, with screen off) but not always-on. The federation can include phones as opportunistic workers but can't rely on them for steady-state pipeline. Still useful — opportunistic contribution is real value — but the economics are different.
Phones are too constrained for useful contribution. Thermal limits, battery cost, or memory pressure make phone participation a bad deal for the user. The federation excludes phones and focuses on the desktop/laptop tier. The community-ownership thesis narrows.

How this connects to other open questions

Phone deployment depends on multiple other open questions clearing first:

Real-WAN throughput (open) is a precondition. Phones live on residential ISPs and cellular networks. Until the federation works on consumer internet at all, phone-class participation is moot.
1B+ scale personalization (open) is a precondition. Phones are useful workers if they're contributing to inferences on models that justify the deployment effort. Running a federation on a 30M model on phones is engineering theatre; running it on a 1B model is real value. The 1B scale has to clear first.
Tolerance for thermal / memory / process-kill conditions (this question) is the phone-specific gate.

All three need to clear before the phone deployment story is real. None has cleared yet. The catalogue's discipline is to make this dependency chain explicit; that's its job.

Why this stays in the open-questions chapter

The bets harness can't measure on-device phone behaviour. The work is operationally distinct from a research bet — it requires hardware procurement, mobile development, and a different style of measurement (longitudinal field testing, not lab experiments). The catalogue keeps it explicit so readers don't extrapolate from the simulation results to "phones work."

When this question resolves, it will be one of the most operationally interesting updates to the catalogue. Until then, the federation's worker tier is reliably desktop/laptop class, with phone participation as a known-aspirational future direction. The catalogue's claims should match exactly this scope: "the federation works on developer-class hardware on LAN; phone-class deployment is open work, with the simulation-level groundwork done but no real-device validation."

That's a calibrated scoping. The federation's deployment story is built on what's actually been measured, with the explicit acknowledgment of what hasn't. Phone deployment is the largest single piece of "what hasn't been measured but matters for the maximalist version of the thesis." The open-questions chapter is the catalogue's commitment to keeping that distinction clear.