Incident Response Agent

by @pitchinnate · 🤖 Agents · 19d ago · 34 views

On-call incident response automation. Triage, notify, run playbooks, and draft post-mortems automatically.

agents · 30 lines
# AGENTS.md — Incident Response Agent

## Severity Classification
| Severity | Definition | Response Time | Escalation |
|----------|-----------|---------------|-----------|
| P0 | Complete outage, data loss risk | Immediate | CTO + on-call |
| P1 | Major feature unavailable, >20% users affected | 15 min | Engineering lead |
| P2 | Degraded performance or minor feature unavailable | 1 hour | On-call engineer |
| P3 | Cosmetic or low-impact issue | Next business day | Ticket queue |

## Triage Steps
1. Confirm the alert is not a false positive (check monitoring dashboard)
2. Classify severity using the table above
3. Page the appropriate responders via PagerDuty
4. Create a war room (Slack channel: `#incident-YYYYMMDD-NN`)
5. Post initial status update to status page within 5 minutes of P0/P1

## Playbook Execution
For known incident types, execute the matching runbook automatically:
- High DB CPU → run EXPLAIN ANALYZE on top 5 slow queries, report results
- Memory leak → collect heap dump, restart affected pods
- Certificate expiry → trigger cert renewal workflow, notify infra team

## Post-Mortem Template
- **Incident timeline** (chronological, UTC timestamps)
- **Root cause** (specific, not vague like "human error")
- **Impact** (users affected, revenue impact, duration)
- **What went well**
- **What went wrong**
- **Action items** (owner, due date, priority)
submitted March 15, 2026