Incident Review Template for AI Failures

Context

AI systems fail in ways classic software doesn’t: probabilistic outputs, distribution shifts, vendor changes, and hidden policy interactions. Postmortems must therefore go beyond root-cause to include model and data behavior, guardrail efficacy, and decision impact. This template is blameless, lightweight, and designed to produce clear control changes.

Core Template

Header & Triage
- Incident ID: YYYYMMDD-domain-sequence
- Severity: S1–S4 (customer harm, regulatory exposure, financial impact, operational disruption)
- Discovery: Monitoring alert, user report, internal review, audit
- Timeframe: First observed → contained → resolved
Context Snapshot
- Use Case: Purpose, decision stakes, HITL gates
- Model & Version: Base/finetuned model, prompt pack version, policies
- Data Inputs: Retrieval sources, freshness, PII/PHI handling
Impact Summary
- Affected: Users, transactions, processes
- Blast Radius: Systems, teams, customers
- Measured Impact: Cost, latency, error rate, SLA breaches
Failure Characterization
- Error Type: Hallucination, retrieval miss, policy misfire, routing error, drift, jailbreaking, data leak
- Repro Steps: Minimal prompt/input to reproduce
- Signals: Logs, eval results, human edits/overrides, anomaly alerts
Contributing Factors
- Model: Temperature, context length, update cadence
- Retrieval: Index coverage, chunking, ranking config
- Guardrails: Filter policy gaps, prompt hardening, HITL placement
- Ops: Caching, timeouts, dependency health, vendor change
Remediations & Control Changes
- Immediate Fix: Patch applied; owner; ETA
- Follow-up Tasks: Tracked as tickets with SLOs
- Control Map: Which guardrails / tests / monitors were added or tightened
Verification & Close
- Regression Tests: Added to golden/eval sets
- Post-Fix Metrics: Error rate, latency, override rate vs. baseline
- Decision: Close / watchlist / hold for next release

Recommended Actions

Adopt Severity & Taxonomy: Standardize S1–S4 and error types across teams.
Wire to Tooling: Create a simple form (or doc template) & link it to your ticketing system.
Golden Set First: Every incident adds at least one test to eval/golden suites.
Control Registry: Maintain a living list of guardrails, tests, and monitors with owners and SLOs.
Monthly Review: Summarize incidents, patterns, and control effectiveness for governance.

Common Pitfalls

Blaming People: Focus on systems & controls; assume good intent.
One-Off Fixes: Patch without strengthening guardrails or tests.
No Repro: Closing incidents without a minimal reproducible case.
Unowned Actions: Remediations without named owners and dates.

Quick Win Checklist

Publish the template (copyable doc or form) with example incidents.
Define S1–S4 and 6–8 error types your org will use.
Require one golden/eval test per incident before closing.

Closing

Great AI teams turn incidents into leverage. A blameless, structured review process, connected to evals, guardrails, and monitoring, reduces repeat failures and steadily raises quality without slowing delivery.

Essay by OneMind Strata Team