Production incident triage (10 minutes)
Goal: Get from unknown outage to a clear hypothesis using evidence.
- Check service status and recent restarts (systemd).
- Inspect logs for the unit and system errors (journalctl, syslog).
- Check resource pressure (CPU, memory, disk).
- Check network listeners and upstream connectivity (ss, curl).
- Capture findings in notes and apply the smallest safe fix.