Bet 58 — Activation steering (1 KB last-layer vector)
A secondary primitive below norm-only. Add a trained additive vector to the residual stream at a single transformer block — usually the last block; nothing else trainable. At the FractalMoE 30M scale, that's 256 parameters, ~1 KB on disk in fp16, an order of magnitude smaller than norm-only's 9 KB and 4 orders of magnitude smaller than full fine-tune's 155 MB.
The bet exists because the per-user adapter design space is bigger than just norm-only and LoRA-r4. Activation steering is a known primitive in the interpretability literature (Turner et al., 2023, the "activation addition" line of work), where steering vectors are typically extracted from contrastive prompts to manipulate model behaviour. The bet asks a different question: can we train a steering vector by gradient descent on a per-user corpus, and does that produce a usable per-user adapter?
The expected answer was somewhere between "competitive with norm-only" and "interesting curiosity, not production-grade." The actual answer is more interesting than either: steering vectors beat norm-only on one user out of three by a large margin, and lose on the other two. This makes steering vectors a secondary primitive — useful when the user's distribution is one where steering wins, less useful otherwise. The federation can ship both, route per-user, and pick the right adapter format per user. That option didn't exist before this bet.
What activation steering is, mechanically
A standard transformer block updates the residual stream by:
x_out = x_in + attention(LN(x_in)) + ffn(LN(x_in + attention(LN(x_in))))
Steering injects an additional additive term:
x_out = x_in + attention(LN(x_in)) + ffn(LN(x_in + ...)) + s
Where s is the trained steering vector — a single tensor of shape [hidden_dim], the same shape as a residual-stream slice. At 30M scale with hidden_dim=256, s has 256 parameters.
Two design choices in the bet:
- Single-layer vs all-layer. A single-layer steer adds
sonly at one specific block. An all-layer steer adds a differents_iat each of the 4 blocks. Single-layer = 256 params, all-layer = 1024 params. - Block placement. Among the 4 blocks, which one to steer at. We tried first, middle, and last. Last-block steering wins.
Setup
Three users (programmer / novelist / scientist), three steering configurations:
- Single-layer last-block — 256 params, ~1 KB. Adds
sat the last transformer block's output. - All-layer — 1024 params, ~2 KB. Adds a different
s_iat each of the 4 blocks. - Single-layer first-block — 256 params, ~1 KB. Adds
sat the first transformer block's output. (Sanity-check; shouldn't beat last-block.)
All configurations train via standard SGD with the same hyperparameters as Bet 49: 100 steps, lr 5e-5, AdamW. Compared against norm-only (the production default).
Pre-registered criteria
- STRICT: single-layer last-block steering ppl < norm-only ppl on ≥ 1 user AND within 1.5× on the others.
- LENIENT: within 1.5× of norm-only on all 3 users.
- CATASTROPHIC: all-layer steering version overshoots and breaks held-out general capability.
The STRICT bar requires steering to beat norm-only on at least one user. If it didn't beat norm-only anywhere, it wouldn't be a useful secondary primitive — the federation would just default to norm-only everywhere. The "within 1.5× on the others" clause prevents steering from being catastrophic on the users it doesn't help.
Result — STRICT PASS, with one big surprise
The headline numbers:
| User | Norm-only ppl | Single-layer (last) ppl | All-layer ppl | |---|---|---|---| | programmer | 6,922 | 4,716 | overshoots | | novelist | 113 | 124 | overshoots | | scientist | 130 | 138 | overshoots |
Single-layer last-block steering beats norm-only on the programmer user by 32% (6,922 → 4,716). That's the surprise. The other users are within 10% (within the LENIENT bound but worse than norm-only). The all-layer version is catastrophic on every user — too many degrees of freedom, overfits, breaks held-out behaviour.
The 32% improvement on the programmer user is large enough to justify a production-grade alternative. For programmer-style users (Python source code, technical documentation), single-layer last-block steering is better than norm-only. For prose-style users (novelist, scientist), norm-only remains the better default.
Why single-layer last-block wins for the programmer
The programmer user's text is structurally narrow — heavy on Python keywords, identifier conventions, structured tokens. The last transformer block's output is "close to the prediction" — the residual stream at that point has been shaped by all prior blocks and is one linear projection away from the logit vector. Adding a learned additive vector at the last block effectively biases the output distribution toward tokens the user prefers, without changing the upstream features.
This is structurally different from what norm-only does. Norm-only scales the per-channel norms throughout the model — it's a multiplicative adjustment that propagates through all subsequent attention and FFN computations. Single-layer last-block steering is an additive adjustment at the very end, with no propagation. The two adapter formats are doing different things to the model.
For users whose distribution shift is last-step-localisable — the user's preferences manifest as which tokens are likely-or-unlikely given the upstream features that the base model already computes — single-layer steering captures the shift cheaply. The programmer user's distribution shift fits this description (narrow vocabulary, narrow structural patterns).
For users whose distribution shift requires upstream feature reweighting — the user prefers different features to be salient throughout the forward pass — norm-only's multiplicative scaling captures the shift better. Prose-style users' distribution shifts seem to fit this description.
This is a hypothesis, not a proof. The per-user-format-selection decision could in principle be made by some learned classifier. We haven't validated such a classifier; the bet establishes that at least one user benefits from a non-default adapter format, which is the entry condition for considering format selection as a problem worth solving.
Why all-layer overshoots
The all-layer version has 1024 trainable parameters. At a 5-minute / 100-step training budget, this is enough capacity to overfit. The held-out perplexity blows up by 5×–50× on every user. The overshoot is consistent with what we observed for full FT in Bet 37 — too many degrees of freedom against too little training data.
This rules out "more steering is better." Steering as a primitive needs to stay restricted — one or two layers, not all. The federation should not generalise from "single-layer steering helps for some users" to "more steering helps more"; the empirical pattern is the opposite.
Failed composition — Bet 59
We followed up by trying to jointly train norm + last-layer steer. The hypothesis was that the two are complementary (different mechanisms, different parameter sets), so training both together should give the union of their benefits.
The result was unambiguously negative. Joint training produces an adapter that's worse than either alone for every user — a 1.20× regression for the scientist (from 130 to 156 ppl), smaller regressions for programmer and novelist. The two primitives don't compose.
The likely cause: when both are trainable simultaneously, the gradient updates for the steering vector compete with the gradient updates for the norm gains. The optimiser ends up partially fitting each but fully fitting neither. The composition is destructive, not additive.
The federation rule that follows: pick one or the other per user. Don't combine. Bet 59's writeup documents this in detail; the production decision is to expose the format choice (norm-only vs steering) per user but not to mix.
What this means for production
The federation's per-user adapter format becomes:
- Default: norm-only at 9 KB. Production-default for users where norm-only suffices (Bet 49 showed it Pareto-dominates LoRA-r4 and full FT).
- Alternative: single-layer last-block steering at 1 KB. For users where steering wins (currently identified: programmer-style users with narrow vocabulary). Smaller adapter than norm-only.
- Mutually exclusive — pick one per user. Joint training (Bet 59) is destructive; routing must select one format per user.
- Adapter format detection: open work. Currently no automated way to detect which users benefit from steering vs norm-only. Manual selection or heuristics (e.g., "users whose corpus is < 30% prose words") may suffice for early deployment; a learned format-selector is a future research direction.
What this does not claim
- Steering generalises to all narrow-vocabulary users. We tested one programmer fixture. Other narrow-vocabulary users (medical notes, legal contracts, code-switched text) are untested.
- Steering at scale. The 32% margin is at 30M params on a small corpus. Whether it survives at 1B+ on real-user data is open.
- Steering composes with quantisation. Bet 52 validated norm-only + ternary; the equivalent for steering hasn't been tested.
- Steering survives the noise-floor control. Bet 60 ran the random-input control for norm-only; the equivalent for steering is open work. It's possible (likely?) that some of the 32% margin is regularisation rather than personalisation, mirroring the norm-only finding. Until the control runs, the steering claim is provisional.
Run command
PYTHONPATH=src python -m experiments.bets.58_activation_steering
Output: experiments/bets/results/58_activation_steering.json records all three steering configurations (single-layer last, single-layer first, all-layer) for all three users, with comparison against norm-only and the no-adapter base.
Related entries
- Bet 37, 46, 49: norm-only as the production default. Steering is a secondary primitive against this default.
- Bet 59: joint training of norm + steering. Falsifies composition.
- Bet 60: noise-floor control for norm-only. Equivalent control for steering is open work.
- Bet 61: confusion matrix for norm-only. Equivalent for steering is open.
- Bet 25: soft-prompt compression. Different parameter-efficient primitive, related family.
Why it matters
The result widens the federation's per-user adapter design space from "one default" to "default + alternative." Norm-only is the production default for most users; single-layer steering is the production alternative for some users. The federation can ship both, expose the format as a per-user policy decision, and pick the right adapter for each user.
This is a small but meaningful expansion. Without Bet 58, the federation would commit to norm-only globally, and any user whose distribution favoured steering would be served sub-optimally. With Bet 58, the federation has a second tool, the failure-mode (Bet 59's destructive composition) is clearly identified, and the design space is honestly mapped. The bet's 32% margin on the programmer user is large enough to justify the additional infrastructure (per-user format selection, dual-format adapter loaders, format detection) that supporting two formats requires.
The methodological note: the single-layer surprise — that one specific block placement wins decisively while other placements (first-block, all-layer) lose — is the kind of result that wouldn't have shown up if we'd tested only one steering configuration. The bet's structure (sweep block placement) is what produced the finding. Future per-user adapter bets should follow the same pattern: sweep the relevant design choice, don't pick one configuration arbitrarily.