Observability: What Matters Beyond Tokens

Technology & Software • ~8–9 min read • Updated Feb 11, 2025 • By OneMind Strata Team

Context

Token spend is a vanity metric if your assistant can’t answer, violates latency budgets, or drifts. Effective observability measures answerability, time-to-decision, and drift—and ties them to SLOs with owners. Spend is the 4th column, not the first.

Core Framework

  1. Define SLOs that reflect outcomes
    • Answerability SLO: % of tasks answered to rubric ≥ threshold (from goldens + spot checks).
    • Latency Budget: p95 end-to-end ≤ X ms per surface (inline/panel/agent).
    • Override Burden: % of actions requiring confirm/override ≤ Y%.
    • Drift Watch: embedding centroid shift, routing mix, refusal precision/recall within bands.
  2. Capture the right traces
    • Inputs, retrieval bundle (doc IDs + trust), prompt pack version, model version, tool calls (args & outcomes), guardrail decisions.
    • Link every trace to a task_id, surface, session_id, and tenant hash.
  3. Route-level visibility
    • Two-model routing (cheap → smart → human) with per-route SLOs and fallbacks.
    • Cache hit/miss, re-ranker on/off, and tool count per task—so you can cost the path, not just the prompt.
  4. Scoreboards that travel
    • R/G by task for answerability, p95 latency, override rate, and spend per accepted action.
    • Weekly “what changed” diff: model/prompt/vendor and their impact on SLOs.

Recommended Actions

  1. Establish 4 SLOs (answerability, latency p95, override burden, drift band) with owners and pager duty.
  2. Standardize trace schema (task, surface, prompt_pack_version, retrieval_ids, guardrail events, tool calls).
  3. Add route tags to logs (route=cheap|smart|human) and break out budgets by route.
  4. Adopt “cost per accepted action” as your north-star efficiency metric.
  5. Drift monitors: canary queries + centroid distance + routing mix shift; alert when thresholds trip.

Common Pitfalls

  • Spend-only dashboards: miss quality and latency regressions until users churn.
  • No versioning in traces: can’t attribute changes to model/prompt/vendor updates.
  • One latency number: p95 only—hide tail pain and surface differences.
  • Opaque retrieval: no provenance → can’t debug misses or audit outputs.

Quick Win Checklist

  • Ship an answerability SLO from your existing golden set (≥ threshold) and display it per task.
  • Track p95 latency by surface and by route; add a tail alarm (p99 > 2× budget).
  • Log override and override_reason and review top 3 reasons weekly.
  • Compute cost per accepted action and use it in release gates.
  • Turn on drift alerts for centroid shift and routing mix changes beyond bands.

Closing

Observe what users feel: answerability, speed, stability. With outcome SLOs, route-aware traces, and simple drift monitors, you’ll prevent regressions, cut costs where it’s safe, and move faster with confidence.