A practical incident runbook for AI agent systems, covering common failure modes and response actions that reduce production impact.
- Most agent incidents are predictable: tool misuse, context drift, and weak guardrails.
- Build a failure taxonomy and link each class to detection and recovery playbooks.
- Track MTTR and recurrence to continuously harden your agent platform.

Agent systems do not fail in one way. They fail across planning, context, tool invocation, and execution boundaries. Without a clear runbook, teams lose time arguing about symptoms instead of restoring service.
This guide provides an operating model you can implement immediately.
Prerequisites
- Incident severity model (SEV1, SEV2, SEV3).
- On-call owner for agent platform.
- Baseline observability for prompts, tool calls, and outcomes.
- Rollback path for model and policy configuration.
Failure taxonomy
1) Intent misclassification
The agent chooses the wrong plan for a valid request.
Signals: - Wrong workflow selected. - High user correction rate. - Repeated retries without convergence.
2) Tool misuse
The agent invokes tools with invalid or risky arguments.
Signals: - API 4xx spikes. - Unexpected write operations. - Policy denials increasing.
3) Context drift
Relevant context is lost, stale, or contradictory.
Signals: - Incoherent multi-step responses. - Contradictions across turns. - Duplicate or circular actions.
4) Safety and policy bypass attempts
The system fails to block unsafe instructions reliably.
Signals: - Prompt injection test failures. - Sensitive endpoint invocation attempts. - Abnormal escalation patterns.
5) Cost and latency runaway
Token or tool costs grow faster than value delivered.
Signals: - Sudden token consumption increases. - Timeout rates rising. - Queue depth and tail latency increase.
Steps: incident runbook
Step 1: Detect and classify
- Identify failure class from telemetry.
- Assign severity and incident owner.
- Freeze non-essential config changes.
Step 2: Contain blast radius
- Disable high-risk tools temporarily.
- Apply stricter policy mode.
- Route traffic to fallback workflow.
Step 3: Recover service
- Roll back model/prompt/policy changes.
- Re-enable only validated capabilities.
- Confirm key user journeys are restored.
Step 4: Verify integrity
- Check data side effects for incorrect writes.
- Validate no policy breaches occurred.
- Capture incident timeline and evidence.
Step 5: Prevent recurrence
- Add regression tests for this failure.
- Update policy and prompt guardrails.
- Publish post-incident improvements.
Implementation plan
Day 1
- Create a one-page taxonomy and ownership map.
- Add correlation IDs for user request -> agent -> tool chain.
Week 1
- Build dashboards for each failure class.
- Implement one-click kill switch for high-risk tools.
Month 1
- Introduce automated chaos scenarios for agent workflows.
- Add recurrence metrics to release gates.
Troubleshooting
Problem: Alerts are noisy and not actionable
- Alert on failed outcomes, not raw error volume.
- Add class-specific thresholds.
- Suppress duplicate alerts by correlation ID.
Problem: Recovery is slow because rollback is unclear
- Version prompts, policies, and model routing separately.
- Keep last-known-good bundles ready.
- Practise rollback drills monthly.
Problem: Teams disagree on incident root cause
- Require timeline + evidence in post-incident review.
- Separate proximate trigger from systemic gap.
- Track corrective actions to closure.
Common mistakes
- Treating every incident as a model-quality problem.
- Shipping new features before stabilising incident classes.
- Missing ownership for tool-level failures.
Reliable agent operations come from disciplined response loops, not perfect models.
Comments
Post a Comment