Monitoring with Prometheus and Grafana
Without metrics, every incident is a guess. Prometheus + Grafana + Alertmanager is the open-source default for collecting metrics, drawing them, and waking you up when they cross a threshold. It powers monitoring at thousands of organizations because the model is simple, the query language is powerful, and the ecosystem of exporters covers nearly every piece of software you might run.
This guide covers the architecture, the day-1 setup, the queries you will actually use, and the alerting hygiene that separates a useful pager from one everyone learns to ignore.
The architecture
The pieces you need:
- Exporters expose metrics over HTTP at
/metricsin a simple text format. Examples:node_exporterfor OS-level metrics,cAdvisorfor containers,postgres_exporterfor PostgreSQL, your own application instrumented with a Prometheus client library. - Prometheus scrapes those endpoints on a schedule and stores the samples in a local time-series database (TSDB). It also evaluates recording rules and alert rules.
- Grafana queries Prometheus (and other data sources) and renders dashboards.
- Alertmanager receives firing alerts from Prometheus, deduplicates and groups them, and routes notifications to Slack, PagerDuty, email, OpsGenie, etc.
Prometheus is pull-based: it reaches out and scrapes targets. This is unusual compared to push-based systems like StatsD, and it has consequences. Targets must be reachable from the Prometheus server, so short-lived jobs need a Pushgateway, and ephemeral targets need a service discovery mechanism (Kubernetes API, Consul, EC2 tags).
A minimal scrape config
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
- job_name: node
static_configs:
- targets: ['node-exporter:9100']
- job_name: app
metrics_path: /metrics
static_configs:
- targets: ['web:3000']
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
The kubernetes-pods job is the workhorse on Kubernetes: any pod annotated with prometheus.io/scrape: "true" is automatically discovered and scraped. No manual target lists.
The four metric types
Prometheus exposes four metric types and choosing the right one matters:
- Counter — monotonically increasing total. Use for "how many requests have we served." Wrap in
rate()to get a per-second rate. Counters survive restarts (the rate function handles resets). - Gauge — current value that goes up and down. Use for "how many active connections," "memory usage," "queue depth."
- Histogram — bucketed observations. Use for latency. The bucket boundaries are pre-defined;
histogram_quantilecomputes percentiles on the server side. - Summary — quantiles computed on the client side. Cheaper to query, but quantiles cannot be aggregated across instances. Prefer histograms.
A common newbie mistake is using a Gauge for something that should be a Counter — you cannot compute a rate from a gauge that gets reset arbitrarily.
PromQL — the queries you will actually run
# Per-instance CPU usage (%)
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
# Memory pressure (used / total)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
# 95th percentile request latency, per route, last 5 minutes
histogram_quantile(0.95,
sum by (le, route) (rate(http_request_duration_seconds_bucket[5m])))
# Error rate (5xx) per service
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m]))
# Pods restarting in the last hour
increase(kube_pod_container_status_restarts_total[1h]) > 0
# Disk filling rate (predict empty in 4 hours)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0
The mental model: most queries are aggregation_function by (label) (rate(metric[range])). Internalize sum, avg, max, rate, increase, and histogram_quantile and you can write 90% of the queries you need.
A practical alert rule
groups:
- name: web
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 10m
labels:
severity: page
team: web
annotations:
summary: "5xx error rate >5% on {{ $labels.service }}"
description: |
Service {{ $labels.service }} is returning >5% 5xx responses.
Check recent deploys and downstream dependencies.
runbook: "https://runbooks.example.com/web/high-error-rate"
The for: 10m clause is critical — without it, every transient blip pages someone. Tune the threshold and duration based on your error budget, not someone's gut feeling.
Alertmanager routing
route:
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-default'
routes:
- match: { severity: page }
receiver: 'pagerduty'
continue: true
- match: { team: data }
receiver: 'slack-data'
receivers:
- name: 'slack-default'
slack_configs:
- channel: '#alerts'
api_url: '${SLACK_WEBHOOK}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '${PD_KEY}'
- name: 'slack-data'
slack_configs:
- channel: '#alerts-data'
Grouping prevents an outage from firing 200 separate alerts; deduplication ensures the same alert from multiple Prometheus replicas fires once; routing sends the right alert to the right team.
SRE-style golden signals
For every user-facing service, track the four golden signals:
- Latency — request duration p50/p95/p99.
- Traffic — requests per second.
- Errors — non-2xx rate (split into 4xx vs 5xx — 4xx is usually a client problem, 5xx is yours).
- Saturation — CPU, memory, queue depth, connection pool utilization.
Build one Grafana dashboard per service that always shows these four panels and you have already eliminated the most common monitoring gap: alerts that fire long after users are already in pain.
Recording rules — speed up expensive queries
Some queries are too expensive to run on every dashboard refresh. Precompute them with recording rules:
groups:
- name: web-recordings
interval: 30s
rules:
- record: service:http_requests:rate5m
expr: sum by (service) (rate(http_requests_total[5m]))
- record: service:http_errors:ratio5m
expr: |
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m]))
Dashboards now query a single short-named series instead of recomputing the heavy expression every refresh.
Production architecture: HA pair, long-term storage, Alertmanager cluster
A single Prometheus is fine for a development cluster. For anything you would page on, you want at least the topology in the diagram above.
The pattern has three layers:
Two Prometheus replicas with identical scrape configuration. They run on different nodes (ideally different availability zones), scrape the same targets, and produce nearly identical data. Either replica can be queried by Grafana via a load balancer. If one fails, the other keeps scraping and alerting; you fix the broken one without losing visibility. Brief sample-level differences between the two are fine — alerting is based on rates and ratios over windows, not on individual samples.
Long-term storage (Thanos, Mimir, or Cortex) sits behind both Prometheus instances. Local Prometheus retention is best at a few weeks of high-resolution data; anything beyond that — month-over-month trends, capacity planning, post-incident analysis — wants a horizontally scalable store backed by object storage (S3, GCS, Azure Blob). The sidecar pattern (Thanos sidecar attached to each Prometheus) ships completed TSDB blocks to object storage and exposes a global query API across all replicas and all retention windows.
An Alertmanager cluster of three or more instances receives alerts from both Prometheus replicas. Alertmanager uses gossip-based clustering to deduplicate: if both Prometheus A and B fire the same alert, the cluster sends exactly one notification. Routing, grouping, silencing, and inhibition all happen at this layer. A cluster of three lets you lose one node without losing the ability to notify on-call.
A few operational guardrails for this topology:
- Identical scrape configs on both Prometheus replicas, managed in git. Drift between A and B causes mysterious "alert from A but not from B" pages.
- Same external\_labels on both replicas (
replica: Avsreplica: B) so deduplication can identify equivalent series. - Monitor the monitoring. A separate small Prometheus (or Grafana Cloud free tier) watches the production Prometheus pair. If both replicas die, you still find out.
- Backups for Alertmanager silences. They live in the Alertmanager state and are easy to lose during cluster recreation.
amtoolcan export and re-import them.
Capacity planning
Prometheus stores ~1–2 bytes per sample on disk. A back-of-envelope formula:
disk = (active_series × samples_per_second × 2 bytes × retention_seconds)
A 100k-series instance scraping every 15s with 30 days retention works out to roughly 35GB. Plan accordingly, and remember that high-cardinality labels (user IDs, request IDs) are the single biggest cause of unexpected blowup.
Common pitfalls
- Labels with high cardinality. Never label by user ID or request path with a UUID. Each unique label value creates a new series; millions of series will OOM Prometheus.
- Alerting on raw values. Alert on rates and ratios, not absolute counts. Absolute counts grow with traffic; you do not want to retune every threshold quarterly.
- No
forclause on alerts. Every blip pages. - Pushing instead of pulling for long-lived services. Pushgateway is for batch jobs, not for everything.
- Storing forever in local Prometheus. Long retention belongs in remote storage (Thanos, Mimir, Cortex) — Prometheus itself is best at the recent few weeks.
SLIs, SLOs, and error budgets
Once you have metrics flowing, the next leverage point is turning them into Service Level Objectives — explicit reliability targets that align engineering and product priorities.
A Service Level Indicator (SLI) is a measurement of one aspect of service health. The classic four:
- Availability: fraction of successful requests.
- Latency: fraction of requests faster than X.
- Throughput: requests per second handled.
- Quality: fraction of full-fidelity (non-degraded) responses.
A Service Level Objective (SLO) is a target value for an SLI over a window: "99.9% of API requests succeed over a rolling 30 days." Subtract from 100% and you get the error budget: 0.1% × 30 days × 24h × 60m = 43.2 minutes of allowed unavailability per month.
Express SLIs in PromQL and you get free alerting and dashboards:
# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# 30-day error budget burn
1 - (
sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
) > 0.001
Multi-window, multi-burn-rate alerts (Google's SRE workbook formulation) catch both fast outages and slow degradations without paging on every blip. They are the gold standard for SLO-based alerting.
The cultural value of SLOs is at least as important as the technical value: they replace "is it up?" arguments with "have we burned 60% of our budget this month?" — a shared, quantified question both engineers and product managers can reason about.
Alert hygiene — fewer, better pages
The fastest path to a useless on-call rotation is too many alerts. A few rules of thumb:
- Every alert needs a runbook. If you cannot write three concrete steps the responder should take, the alert is not actionable.
- Page only on user impact. "CPU is high" is not a page; "p99 latency >2s for 10 minutes" is.
- Tickets, not pages, for slow burns. Disk filling at the current rate in three days is a ticket. Disk full now is a page.
- Quarterly alert review. Delete every alert nobody acted on in the last quarter. They are training your team to ignore the pager.
- Track MTTA and MTTR per alert. If acknowledgment is fast but resolution is slow, the runbook is bad. If acknowledgment is slow, the alert is probably noise.
A team with twenty alerts they trust outperforms a team with two hundred alerts they ignore. Aggressive pruning is a feature, not a regression.
What good monitoring buys you
A well-tuned Prometheus stack tells you, within seconds, what is broken, where, and how badly — and routes that information to the human who can do something about it. It also tells you, just as importantly, when nothing is broken, so you can trust silence. That trust is what makes on-call sustainable.
The investment in dashboards and alert rules pays compounding interest. Every incident becomes a chance to add the panel or rule that would have caught it earlier — and the next time something similar happens, your monitoring catches it before users do.
A team that takes monitoring seriously will, over a year, accumulate a per-service playbook so detailed that almost any incident maps to a known dashboard panel and a known runbook. That is the goal — not a bigger pile of metrics, but a sharper, faster path from alert to root cause to fix. Prometheus and Grafana are just the substrate; the real work is the steady, deliberate curation of what you measure and how you respond.