Runbooks — Boxi

# Agent Model Experimentation

Swap which model powers a given agent and measure the result.

## When to run

- Costs spiked. Suspect a cheaper model still meets quality.
- Quality dropped. Suspect advisor/full mode fixes it.
- New model released (e.g. `claude-haiku-4-5`).

## Steps

1. **Snapshot baseline** — `/automation/agents` → select agent → scroll to Recent Runs. Note last-10 success rate + avg cost.
2. **Change mode** — `/automation/settings` → Agents tab → Edit the agent → Mode dropdown: eco / standard / advisor / full.
3. **Flush cache** — no action needed. Provider-resolver reads AgentConfig on every run.
4. **Run 10 trials** — trigger the agent 10 times with representative inputs.
   - For Lead Scorer: `bun run scripts/test-lead-scorer.ts` (10-run harness).
5. **Compare** — check Recent Runs; compute new success rate + avg cost + latency.
6. **Decide**:
   - Success rate dropped >5%: revert.
   - Cost cut ≥30% and rate within 2%: keep.
   - Mixed: run another 10.

## Mode cheat sheet

| Mode | Model | Use when |
|------|-------|----------|
| eco | claude-haiku-4-5 | High volume, simple classification, deterministic tasks |
| standard | claude-sonnet-4-6 | Default. Drafts, emails, routing logic |
| advisor | claude-sonnet-4-6 | Strategic suggestions (orchestrator tick) |
| full | claude-opus-4-7 | Complex reasoning, multi-skill orchestration |
| custom | Provider override | Bring-your-own config |

## Rollback

Switch mode back in `/automation/settings`. Takes effect on next run.