Observability: What Matters Beyond Tokens

Context

Token spend is a vanity metric if your assistant can’t answer, violates latency budgets, or drifts. Effective observability measures answerability, time-to-decision, and drift, and ties them to SLOs with owners. Spend is the 4th column, not the first.

Core Framework

Define SLOs that reflect outcomes
- Answerability SLO: % of tasks answered to rubric ≥ threshold (from goldens + spot checks).
- Latency Budget: p95 end-to-end ≤ X ms per surface (inline/panel/agent).
- Override Burden: % of actions requiring confirm/override ≤ Y%.
- Drift Watch: embedding centroid shift, routing mix, refusal precision/recall within bands.
Capture the right traces
- Inputs, retrieval bundle (doc IDs + trust), prompt pack version, model version, tool calls (args & outcomes), guardrail decisions.
- Link every trace to a task_id, surface, session_id, and tenant hash.
Route-level visibility
- Two-model routing (cheap → smart → human) with per-route SLOs and fallbacks.
- Cache hit/miss, re-ranker on/off, and tool count per task, so you can cost the path, not just the prompt.
Scoreboards that travel
- R/G by task for answerability, p95 latency, override rate, and spend per accepted action.
- Weekly “what changed” diff: model/prompt/vendor and their impact on SLOs.

Recommended Actions

Establish 4 SLOs (answerability, latency p95, override burden, drift band) with owners and pager duty.
Standardize trace schema (task, surface, prompt_pack_version, retrieval_ids, guardrail events, tool calls).
Add route tags to logs (route=cheap|smart|human) and break out budgets by route.
Adopt “cost per accepted action” as your north-star efficiency metric.
Drift monitors: canary queries + centroid distance + routing mix shift; alert when thresholds trip.

Common Pitfalls

Spend-only dashboards: miss quality and latency regressions until users churn.
No versioning in traces: can’t attribute changes to model/prompt/vendor updates.
One latency number: p95 only, hide tail pain and surface differences.
Opaque retrieval: no provenance → can’t debug misses or audit outputs.

Quick Win Checklist

Ship an answerability SLO from your existing golden set (≥ threshold) and display it per task.
Track p95 latency by surface and by route; add a tail alarm (p99 > 2× budget).
Log override and override_reason and review top 3 reasons weekly.
Compute cost per accepted action and use it in release gates.
Turn on drift alerts for centroid shift and routing mix changes beyond bands.

Closing

Observe what users feel: answerability, speed, stability. With outcome SLOs, route-aware traces, and simple drift monitors, you’ll prevent regressions, cut costs where it’s safe, and move faster with confidence.

Essay by OneMind Strata Team