Cost Postmortems That Actually Change Things
Public Sector • ~7–8 min read • Updated Oct 27, 2024
Context
“Costs are too high” helps no one. Effective teams treat cost spikes like incidents: gather traces, segment traffic, isolate the few flows that drive most spend, and ship specific fixes. The goal isn’t cheaper at any price—it’s maintaining quality while shrinking waste.
Core Framework: The Cost Postmortem
- Define the unit of value: Conversation, task, workflow, or document. Report cost per unit, not per token.
- Trace the path: Prompt → retrieval → model calls (route mix) → tools → retries. Capture inputs, params, and token counts.
- Segment traffic: By intent (Q&A vs. drafting), customer tier, language, channel, and length. Pareto usually shows 20% of flows drive 80% of spend.
- Quality guardrails: Baseline answerability, safety, latency; all optimizations must preserve thresholds.
- Fix-by-lever catalog: Caching, prompt trims, route mix, tool gating, batching, parameter tuning, response caps.
Recommended Actions
- Instrument cost traces: Log input/output tokens, route, model, latency, cache hit, retries per unit.
- Ship a Pareto board: Top 10 flows by total cost and cost-per-unit with trend lines.
- Apply low-risk levers first:
- Semantic/response caching for frequent prompts and stable answers.
- Prompt trims: remove boilerplate and redundant citations; cap
max_tokens
. - Route mix: two-model pattern (cheap first, smart on escalation/uncertainty).
- Guard with evals: Run golden sets after each change; block merges if answerability or safety falls.
- Publish a change-log: Owner, lever, expected savings, measured impact after 7 days.
Common Pitfalls
- Token absolutism: Optimizing tokens while cost-per-resolution stays flat.
- Cache and forget: No TTLs or invalidation → stale answers and trust erosion.
- Over-routing: Everything through the small model; quality tanks, overrides spike.
- Free tools aren’t free: Tool calls with large payloads dwarf model costs.
- No rollback: Optimization shipped without flags or metrics → hard to revert.
Quick Win Checklist
- Add route_hit_ratio, cache_hit_ratio, and cost_per_unit to dashboards.
- Cap
max_tokens
and temperature on low-stakes flows; freeze prompt boilerplate size. - Enable response caching for top 20 prompts; set TTL by risk tier.
- Adopt the two-model pattern for drafts/summaries with uncertainty-based escalation.
- Feature-flag each change; compare 7-day pre/post on cost and quality.
Closing
Cost postmortems should end in fewer expensive calls, fewer retries, and the same (or better) quality. With good traces, clear segments, and guarded levers, savings stick and teams keep shipping.