The “Two-Model” Pattern for Cost & Reliability

Cross-Industry • ~8–9 min read • Updated Jan 11, 2025 • By OneMind Strata Team

Context

Running everything through your best model is easy—and expensive. Production teams win with a simple play: route most traffic to a cheap model and escalate only when signals say “this is hard or high-stakes.” Done well, this cuts spend, protects latency budgets, and improves reliability without sacrificing outcomes.

Core Framework

  1. Three routes, not two: cheap → smart → human.
    • Cheap: default, fast responses for clear/easy cases.
    • Smart: bigger model or tool-augmented path for ambiguous/critical cases.
    • Human: confirm/override for high-risk or policy-triggered cases.
  2. Escalation signals: Decide what triggers handoff:
    • Confidence: self-score, logprob bands, or evaluator micro-model.
    • Policy: safety/PII classifiers, tenant/role constraints.
    • Task: category-specific rules (financial advice, medical claims, critical ops).
    • Structure: required fields missing; format not met.
  3. Path SLOs: Own budgets per route: p95 latency, answerability, override burden, and cost per accepted action.
  4. Cache & reuse first: Put caching, rerankers, and snippets before escalation to avoid unnecessary “smart” calls.

Recommended Actions

  1. Define route criteria: A single page listing escalation signals and thresholds (cheap→smart, smart→human).
  2. Implement a route tag: Log route=cheap|smart|human on every trace with route_reason and threshold values.
  3. Wire a verifier: Use a tiny evaluator model or heuristics to judge “answerability to rubric.”
  4. Set budgets: p95 latency per route, target escalation rate (e.g., ≤ 18%), and max cost-per-accepted-action.
  5. Freeze a fallback: For timeouts or model errors, return a rule-based template or last-known-good draft.

Reference Architecture

  • Ingress: request + task_id + surface + context.
  • Cheap path: small model → validator (format/policy) → accept if pass.
  • Escalator: if fail or “low-confidence,” send to smart path (bigger model, retrieval + tools).
  • Human gate: if still low-confidence or policy at risk, open confirm/override with rationale & evidence.
  • Telemetry: traces include route, thresholds, model/prompt versions, guardrail events, latency, cost.

Common Pitfalls

  • Over-escalation: thresholds too conservative → costs spike; tune against golden sets.
  • Unowned budgets: no SLO owner per path → drift goes unnoticed.
  • Opaque routing: no route_reason → you can’t debug escalations or reduce them.
  • Missing fallback: timeouts on smart path stall users; always return a safe alternative.

Quick Win Checklist

  • Add route and route_reason to traces; graph escalation rate daily.
  • Stand up a tiny evaluator (rubric 0–3) and escalate on scores ≤ 1.
  • Set p95 latency budgets per route and block releases that regress.
  • Cache successful responses for frequent tasks; measure cache hit rate.
  • Publish a one-pager: thresholds, owners, and rollback plan.

Closing

The two-model pattern is really a three-route operating model: cheap by default, smart when justified, human when needed. With clear signals, route-aware SLOs, and solid fallbacks, you’ll ship faster, spend less, and raise trust over time.