Monitoring with Prometheus and Grafana

Monitoring stack: exporters scraped by Prometheus, visualized in Grafana, alerts via Alertmanager

Without metrics, every incident is a guess. Prometheus + Grafana + Alertmanager is the open-source default for collecting metrics, drawing them, and waking you up when they cross a threshold. It powers monitoring at thousands of organizations because the model is simple, the query language is powerful, and the ecosystem of exporters covers nearly every piece of software you might run.

This guide covers the architecture, the day-1 setup, the queries you will actually use, and the alerting hygiene that separates a useful pager from one everyone learns to ignore.

The architecture

The pieces you need:

Exporters expose metrics over HTTP at /metrics in a simple text format. Examples: node_exporter for OS-level metrics, cAdvisor for containers, postgres_exporter for PostgreSQL, your own application instrumented with a Prometheus client library.
Prometheus scrapes those endpoints on a schedule and stores the samples in a local time-series database (TSDB). It also evaluates recording rules and alert rules.
Grafana queries Prometheus (and other data sources) and renders dashboards.
Alertmanager receives firing alerts from Prometheus, deduplicates and groups them, and routes notifications to Slack, PagerDuty, email, OpsGenie, etc.

Prometheus is pull-based: it reaches out and scrapes targets. This is unusual compared to push-based systems like StatsD, and it has consequences. Targets must be reachable from the Prometheus server, so short-lived jobs need a Pushgateway, and ephemeral targets need a service discovery mechanism (Kubernetes API, Consul, EC2 tags).

A minimal scrape config

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

  - job_name: node
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: app
    metrics_path: /metrics
    static_configs:
      - targets: ['web:3000']

  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

The kubernetes-pods job is the workhorse on Kubernetes: any pod annotated with prometheus.io/scrape: "true" is automatically discovered and scraped. No manual target lists.

The four metric types

Prometheus exposes four metric types and choosing the right one matters:

Counter — monotonically increasing total. Use for "how many requests have we served." Wrap in rate() to get a per-second rate. Counters survive restarts (the rate function handles resets).
Gauge — current value that goes up and down. Use for "how many active connections," "memory usage," "queue depth."
Histogram — bucketed observations. Use for latency. The bucket boundaries are pre-defined; histogram_quantile computes percentiles on the server side.
Summary — quantiles computed on the client side. Cheaper to query, but quantiles cannot be aggregated across instances. Prefer histograms.

A common newbie mistake is using a Gauge for something that should be a Counter — you cannot compute a rate from a gauge that gets reset arbitrarily.

PromQL — the queries you will actually run

# Per-instance CPU usage (%)
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100

# Memory pressure (used / total)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
  / node_memory_MemTotal_bytes * 100

# 95th percentile request latency, per route, last 5 minutes
histogram_quantile(0.95,
  sum by (le, route) (rate(http_request_duration_seconds_bucket[5m])))

# Error rate (5xx) per service
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  / sum by (service) (rate(http_requests_total[5m]))

# Pods restarting in the last hour
increase(kube_pod_container_status_restarts_total[1h]) > 0

# Disk filling rate (predict empty in 4 hours)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0

The mental model: most queries are aggregation_function by (label) (rate(metric[range])). Internalize sum, avg, max, rate, increase, and histogram_quantile and you can write 90% of the queries you need.

A practical alert rule

groups:
  - name: web
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            / sum(rate(http_requests_total[5m])) by (service) > 0.05
        for: 10m
        labels:
          severity: page
          team: web
        annotations:
          summary: "5xx error rate >5% on {{ $labels.service }}"
          description: |
            Service {{ $labels.service }} is returning >5% 5xx responses.
            Check recent deploys and downstream dependencies.
          runbook: "https://runbooks.example.com/web/high-error-rate"

The for: 10m clause is critical — without it, every transient blip pages someone. Tune the threshold and duration based on your error budget, not someone's gut feeling.

Alertmanager routing

route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-default'
  routes:
    - match: { severity: page }
      receiver: 'pagerduty'
      continue: true
    - match: { team: data }
      receiver: 'slack-data'

receivers:
  - name: 'slack-default'
    slack_configs:
      - channel: '#alerts'
        api_url: '${SLACK_WEBHOOK}'
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '${PD_KEY}'
  - name: 'slack-data'
    slack_configs:
      - channel: '#alerts-data'

Grouping prevents an outage from firing 200 separate alerts; deduplication ensures the same alert from multiple Prometheus replicas fires once; routing sends the right alert to the right team.

SRE-style golden signals

For every user-facing service, track the four golden signals:

Latency — request duration p50/p95/p99.
Traffic — requests per second.
Errors — non-2xx rate (split into 4xx vs 5xx — 4xx is usually a client problem, 5xx is yours).
Saturation — CPU, memory, queue depth, connection pool utilization.

Build one Grafana dashboard per service that always shows these four panels and you have already eliminated the most common monitoring gap: alerts that fire long after users are already in pain.

Recording rules — speed up expensive queries

Some queries are too expensive to run on every dashboard refresh. Precompute them with recording rules:

groups:
  - name: web-recordings
    interval: 30s
    rules:
      - record: service:http_requests:rate5m
        expr: sum by (service) (rate(http_requests_total[5m]))
      - record: service:http_errors:ratio5m
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
            / sum by (service) (rate(http_requests_total[5m]))

Dashboards now query a single short-named series instead of recomputing the heavy expression every refresh.

Production architecture: HA pair, long-term storage, Alertmanager cluster

Two Prometheus replicas scrape the same targets, write to long-term storage, and feed an Alertmanager cluster that deduplicates notifications

A single Prometheus is fine for a development cluster. For anything you would page on, you want at least the topology in the diagram above.

The pattern has three layers:

Two Prometheus replicas with identical scrape configuration. They run on different nodes (ideally different availability zones), scrape the same targets, and produce nearly identical data. Either replica can be queried by Grafana via a load balancer. If one fails, the other keeps scraping and alerting; you fix the broken one without losing visibility. Brief sample-level differences between the two are fine — alerting is based on rates and ratios over windows, not on individual samples.

Long-term storage (Thanos, Mimir, or Cortex) sits behind both Prometheus instances. Local Prometheus retention is best at a few weeks of high-resolution data; anything beyond that — month-over-month trends, capacity planning, post-incident analysis — wants a horizontally scalable store backed by object storage (S3, GCS, Azure Blob). The sidecar pattern (Thanos sidecar attached to each Prometheus) ships completed TSDB blocks to object storage and exposes a global query API across all replicas and all retention windows.

An Alertmanager cluster of three or more instances receives alerts from both Prometheus replicas. Alertmanager uses gossip-based clustering to deduplicate: if both Prometheus A and B fire the same alert, the cluster sends exactly one notification. Routing, grouping, silencing, and inhibition all happen at this layer. A cluster of three lets you lose one node without losing the ability to notify on-call.

A few operational guardrails for this topology:

Identical scrape configs on both Prometheus replicas, managed in git. Drift between A and B causes mysterious "alert from A but not from B" pages.
Same external\_labels on both replicas (replica: A vs replica: B) so deduplication can identify equivalent series.
Monitor the monitoring. A separate small Prometheus (or Grafana Cloud free tier) watches the production Prometheus pair. If both replicas die, you still find out.
Backups for Alertmanager silences. They live in the Alertmanager state and are easy to lose during cluster recreation. amtool can export and re-import them.

Capacity planning

Prometheus stores ~1–2 bytes per sample on disk. A back-of-envelope formula:

disk = (active_series × samples_per_second × 2 bytes × retention_seconds)

A 100k-series instance scraping every 15s with 30 days retention works out to roughly 35GB. Plan accordingly, and remember that high-cardinality labels (user IDs, request IDs) are the single biggest cause of unexpected blowup.

Common pitfalls

Labels with high cardinality. Never label by user ID or request path with a UUID. Each unique label value creates a new series; millions of series will OOM Prometheus.
Alerting on raw values. Alert on rates and ratios, not absolute counts. Absolute counts grow with traffic; you do not want to retune every threshold quarterly.
No for clause on alerts. Every blip pages.
Pushing instead of pulling for long-lived services. Pushgateway is for batch jobs, not for everything.
Storing forever in local Prometheus. Long retention belongs in remote storage (Thanos, Mimir, Cortex) — Prometheus itself is best at the recent few weeks.

SLIs, SLOs, and error budgets

Once you have metrics flowing, the next leverage point is turning them into Service Level Objectives — explicit reliability targets that align engineering and product priorities.

A Service Level Indicator (SLI) is a measurement of one aspect of service health. The classic four:

Availability: fraction of successful requests.
Latency: fraction of requests faster than X.
Throughput: requests per second handled.
Quality: fraction of full-fidelity (non-degraded) responses.

A Service Level Objective (SLO) is a target value for an SLI over a window: "99.9% of API requests succeed over a rolling 30 days." Subtract from 100% and you get the error budget: 0.1% × 30 days × 24h × 60m = 43.2 minutes of allowed unavailability per month.

Express SLIs in PromQL and you get free alerting and dashboards:

# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# 30-day error budget burn
1 - (
  sum(rate(http_requests_total{status!~"5.."}[30d]))
    / sum(rate(http_requests_total[30d]))
) > 0.001

Multi-window, multi-burn-rate alerts (Google's SRE workbook formulation) catch both fast outages and slow degradations without paging on every blip. They are the gold standard for SLO-based alerting.

The cultural value of SLOs is at least as important as the technical value: they replace "is it up?" arguments with "have we burned 60% of our budget this month?" — a shared, quantified question both engineers and product managers can reason about.

Alert hygiene — fewer, better pages

The fastest path to a useless on-call rotation is too many alerts. A few rules of thumb:

Every alert needs a runbook. If you cannot write three concrete steps the responder should take, the alert is not actionable.
Page only on user impact. "CPU is high" is not a page; "p99 latency >2s for 10 minutes" is.
Tickets, not pages, for slow burns. Disk filling at the current rate in three days is a ticket. Disk full now is a page.
Quarterly alert review. Delete every alert nobody acted on in the last quarter. They are training your team to ignore the pager.
Track MTTA and MTTR per alert. If acknowledgment is fast but resolution is slow, the runbook is bad. If acknowledgment is slow, the alert is probably noise.

A team with twenty alerts they trust outperforms a team with two hundred alerts they ignore. Aggressive pruning is a feature, not a regression.

What good monitoring buys you

A well-tuned Prometheus stack tells you, within seconds, what is broken, where, and how badly — and routes that information to the human who can do something about it. It also tells you, just as importantly, when nothing is broken, so you can trust silence. That trust is what makes on-call sustainable.

The investment in dashboards and alert rules pays compounding interest. Every incident becomes a chance to add the panel or rule that would have caught it earlier — and the next time something similar happens, your monitoring catches it before users do.

A team that takes monitoring seriously will, over a year, accumulate a per-service playbook so detailed that almost any incident maps to a known dashboard panel and a known runbook. That is the goal — not a bigger pile of metrics, but a sharper, faster path from alert to root cause to fix. Prometheus and Grafana are just the substrate; the real work is the steady, deliberate curation of what you measure and how you respond.