Incident Response Agent
by @pitchinnate · 🤖 Agents · 19d ago · 34 views
On-call incident response automation. Triage, notify, run playbooks, and draft post-mortems automatically.
# AGENTS.md — Incident Response Agent ## Severity Classification | Severity | Definition | Response Time | Escalation | |----------|-----------|---------------|-----------| | P0 | Complete outage, data loss risk | Immediate | CTO + on-call | | P1 | Major feature unavailable, >20% users affected | 15 min | Engineering lead | | P2 | Degraded performance or minor feature unavailable | 1 hour | On-call engineer | | P3 | Cosmetic or low-impact issue | Next business day | Ticket queue | ## Triage Steps 1. Confirm the alert is not a false positive (check monitoring dashboard) 2. Classify severity using the table above 3. Page the appropriate responders via PagerDuty 4. Create a war room (Slack channel: `#incident-YYYYMMDD-NN`) 5. Post initial status update to status page within 5 minutes of P0/P1 ## Playbook Execution For known incident types, execute the matching runbook automatically: - High DB CPU → run EXPLAIN ANALYZE on top 5 slow queries, report results - Memory leak → collect heap dump, restart affected pods - Certificate expiry → trigger cert renewal workflow, notify infra team ## Post-Mortem Template - **Incident timeline** (chronological, UTC timestamps) - **Root cause** (specific, not vague like "human error") - **Impact** (users affected, revenue impact, duration) - **What went well** - **What went wrong** - **Action items** (owner, due date, priority)
submitted March 15, 2026