Observability: What Matters Beyond Tokens
Technology & Software • ~8–9 min read • Updated Feb 11, 2025 • By OneMind Strata Team
Context
Token spend is a vanity metric if your assistant can’t answer, violates latency budgets, or drifts. Effective observability measures answerability, time-to-decision, and drift—and ties them to SLOs with owners. Spend is the 4th column, not the first.
Core Framework
- Define SLOs that reflect outcomes
- Answerability SLO: % of tasks answered to rubric ≥ threshold (from goldens + spot checks).
- Latency Budget: p95 end-to-end ≤ X ms per surface (inline/panel/agent).
- Override Burden: % of actions requiring confirm/override ≤ Y%.
- Drift Watch: embedding centroid shift, routing mix, refusal precision/recall within bands.
- Capture the right traces
- Inputs, retrieval bundle (doc IDs + trust), prompt pack version, model version, tool calls (args & outcomes), guardrail decisions.
- Link every trace to a task_id, surface, session_id, and tenant hash.
- Route-level visibility
- Two-model routing (cheap → smart → human) with per-route SLOs and fallbacks.
- Cache hit/miss, re-ranker on/off, and tool count per task—so you can cost the path, not just the prompt.
- Scoreboards that travel
- R/G by task for answerability, p95 latency, override rate, and spend per accepted action.
- Weekly “what changed” diff: model/prompt/vendor and their impact on SLOs.
Recommended Actions
- Establish 4 SLOs (answerability, latency p95, override burden, drift band) with owners and pager duty.
- Standardize trace schema (task, surface, prompt_pack_version, retrieval_ids, guardrail events, tool calls).
- Add route tags to logs (
route=cheap|smart|human
) and break out budgets by route. - Adopt “cost per accepted action” as your north-star efficiency metric.
- Drift monitors: canary queries + centroid distance + routing mix shift; alert when thresholds trip.
Common Pitfalls
- Spend-only dashboards: miss quality and latency regressions until users churn.
- No versioning in traces: can’t attribute changes to model/prompt/vendor updates.
- One latency number: p95 only—hide tail pain and surface differences.
- Opaque retrieval: no provenance → can’t debug misses or audit outputs.
Quick Win Checklist
- Ship an answerability SLO from your existing golden set (≥ threshold) and display it per task.
- Track p95 latency by surface and by route; add a tail alarm (p99 > 2× budget).
- Log override and override_reason and review top 3 reasons weekly.
- Compute cost per accepted action and use it in release gates.
- Turn on drift alerts for centroid shift and routing mix changes beyond bands.
Closing
Observe what users feel: answerability, speed, stability. With outcome SLOs, route-aware traces, and simple drift monitors, you’ll prevent regressions, cut costs where it’s safe, and move faster with confidence.