Goal
Handle incidents calmly with a repeatable playbook.
The incident loop
- detect (alert / report)
- triage (scope + severity)
- mitigate (stop bleeding)
- recover (restore service)
- learn (postmortem + fixes)
Triage checklist
- what changed recently?
- is it all users or one region?
- is error rate rising or only latency?
- is a dependency down?
Communication (simple rules)
- one incident lead
- update regularly (every 15-30 min)
- write what you know + what you are doing next
Postmortem format
- timeline
- root cause
- contributing factors
- action items (owners + dates)
Next Step
Turn the playbook into runbooks and automate the common fixes.