Goal
Know what to measure and how to use telemetry to make decisions.
The three signals
- metrics: numbers over time (latency, error rate)
- logs: events (context)
- traces: request flow across services
Start with RED (for services)
- Rate (requests/second)
- Errors (error rate)
- Duration (latency)
And USE (for infrastructure)
- Utilization
- Saturation
- Errors
Practical dashboard checklist
- latency p50/p95/p99
- error rate
- throughput
- CPU/memory
- dependency health
Next Step
Incident response: use these signals to triage and recover faster.