Micro-telemetry: What to Log for Learning
Retail & Consumer • ~8–9 min read • Updated Jun 19, 2025 • By OneMind Strata Team
Context
Dashboards full of token counts won’t tell you why people trust—or abandon—your AI. The signals that move quality are small and behavioral: edits, reverts, abandonments, overrides, dwell-time. Instrument those well and you’ll know what to fix this week, not next quarter.
What to Log (the short list)
- surface_used (inline | panel | slash | agent), task_id, session_id
- suggestion_shown → applied | edited | reverted | abandoned
- edit_distance (0–1 normalized token/Levenshtein diff)
- dwell_ms (show → action), time_to_first_action_ms
- override (boolean) + override_reason (enum)
- guardrail_event (refusal | unsafe_output | policy_violation) + classifier scores
- latency_ms, cost_usd, model_version, prompt_pack_version
- retrieval_ids (top-k IDs/facets—non-PII)
Minimal Event Model
{
"event_name": "suggestion_applied",
"ts": "2025-06-19T14:22:05Z",
"user_id_hash": "u_7f3c...", "tenant_id_hash": "t_12ab...",
"session_id": "s_91c2...", "task_id": "order_refund_summary",
"surface": "panel",
"model_version": "omx-2025-06-01", "prompt_pack_version": "pp-14",
"latency_ms": 412, "cost_usd": 0.0041,
"dwell_ms": 5600, "edit_distance": 0.18,
"override": false, "override_reason": null,
"retrieval_ids": ["kb:returns#policy-2024", "kb:sop#refund-steps"],
"guardrail_event": null
}
Derived Metrics that Matter
- Adoption rate:
applied / shown
- Edit distance p50/p90: lower is better
- Revert rate:
reverted / applied
- Abandonment:
abandoned / shown
- Time-to-decision: p50/p90 of
dwell_ms
- Override rate and top reasons
- Guardrail precision/recall (incident-labeled)
- Cost-per-accepted-action & latency budget hit-rate
Dashboards that Travel Across Teams
- By surface: adoption, edit distance, abandonment (inline vs panel vs slash vs agent)
- By task: top 10 tasks by accepted actions and revert rate
- Quality × Cost: acceptance vs cost
- Guardrail view: refusals over time, FP/FN estimates, override heatmap
- Experiment lane: A/B of
prompt_pack_version
vs acceptance & edit distance
Privacy & Governance (non-negotiables)
- Hash or pseudonymize IDs; avoid raw PII in events.
- Strip payloads; log structure, not full prompts/outputs.
- Retention: 90 days raw → 12 months aggregates; legal holds via tag.
- Role-based access: product vs risk vs support.
Instrument in a Week (practical steps)
- SDK wrapper: one function per event; auto-attach session/model/prompt_pack/surface.
- Event IDs & sampling: dedupe and throttle client-side.
- Warehouse tables:
events_raw
,events_daily
,derived_metrics
. - Golden questions: daily replay to watch acceptance/latency drift.
- Governance sync: monthly guardrail metrics + top overrides.
Common Pitfalls
- Everything logging, nothing learning: too many fields; no derived metrics.
- Raw text hoarding: increases risk; prefer IDs/facets/hashes.
- No versioning: can’t attribute change without model/prompt versions.
- Surface blindness: mixing surfaces hides where UX truly works.
Quick Win Checklist
- Ship shown/applied/edited/reverted/abandoned with
edit_distance
anddwell_ms
. - Add
model_version
+prompt_pack_version
to every event. - Publish one-pager of derived metrics & definitions.
- Stand up a surface-by-task dashboard; review weekly.
Closing
Log less, learn more. A small, purpose-built telemetry set will show which surfaces to double down on, where to graduate to background agents, and which prompt packs to retire. Make learning a weekly habit.