Monitoring & Observability

Prometheus cheat sheet

Prometheus collects metrics and powers alerting in DevOps systems.

Level: AdvancedFull documentation →Practice drill →

On this page

Quick workflow Workflows Key concepts Quick start Common commands Snippets Troubleshooting Pitfalls Mini lab Interview prompts Official docs

Table of contents

Quick workflow Workflows Key concepts Quick start Common commands Snippets Troubleshooting Pitfalls Mini lab Interview prompts Official docs

Use this page for fast recall. Use Full documentation when you want the complete end-to-end path.

Quick workflow

A simple 5-step flow you can follow when using Prometheus in real work.

1) Setup

Install/run the tool and confirm version. Create a minimal config.

2) Small change

Do one small action end-to-end to prove the workflow.

3) Validate

Check output, logs, and status. Catch mistakes early.

4) Automate

Convert it into a repeatable script or pipeline step.

5) Productionize

Add safety: secrets, rollback, observability, and docs.

Workflows you will actually reuse

These are practical sequences you can copy into your own checklist or runbook.

Alerting that is not noise

Goal: Create alerts that map to user impact and are actionable.

- Start with SLO signals (latency, errors, saturation).

- Create recording rules for expensive queries.

- Add alert labels and runbooks.

- Test alerts with controlled failures.

- Review alert volume and iterate.

Key Concepts

- Scraping

- PromQL

- Alert rules

Learning path (high-level):

- Metric collection

- Query writing

- Alert design

Quick Start

Run Prometheus

Command

docker run -p 9090:9090 prom/prometheus

Open UI

Command

http://localhost:9090

Run query

Command

up

Common Commands

Short descriptions and practical intent. Search, filter, copy, and reuse.

Showing 11

Ops

promtool check config prometheus.yml

Validate Prometheus config.

Ops

promtool check rules alerts.yml

Validate alert rule files.

Ops

curl http://localhost:9090/-/ready

Check readiness endpoint.

Ops

curl http://localhost:9090/-/healthy

Check health endpoint.

Debug

curl http://localhost:9090/api/v1/targets

List scrape targets via API.

Debug

curl http://localhost:9090/api/v1/rules

List active rules via API.

Debug

curl http://localhost:9090/api/v1/alerts

List active alerts via API.

PromQL

up

PromQL: target is up (1) or down (0).

PromQL

rate(http_requests_total[5m])

PromQL: request rate over 5m.

PromQL

sum(rate(http_requests_total[5m])) by (status)

PromQL: rate by status.

PromQL

histogram_quantile(0.95, sum(rate(request_duration_bucket[5m])) by (le))

PromQL: p95 latency from histogram.

Copyable snippets

Small blocks you can drop into your terminal, config, or runbook.

Quickly list failing scrape targets (API)

bash

curl -s http://localhost:9090/api/v1/targets | head

Troubleshooting checklist

When things break, follow this order to stay calm and move fast.

- If targets are down: check service discovery, network, and scrape endpoint `/metrics`.

- If queries are slow: use recording rules and reduce label cardinality.

- If alerts fire constantly: add better thresholds and use multi-window burn-rate alerts.

Pitfalls

The common mistakes that slow people down when using Prometheus.

- Copy-pasting commands without understanding inputs/outputs and side effects.

- Not documenting defaults (ports, paths, credentials) and then getting stuck in prod.

- Skipping logs and metrics when troubleshooting; always collect evidence first.

Mini lab (practice)

Do these tasks in order. You will feel the tool instead of just reading about it.

- Install or run the tool locally (or in Docker) and verify it works with a hello-world action.

- Create a minimal config and run the most common command 3 times (with a small change each time).

- Break something on purpose and document how you debugged it in your Notes.

Interview prompts

Use these to test if you truly understand the basics (and can explain them clearly).

- Explain the tool’s role in a real CI/CD pipeline from commit to production.

- Describe the most common failure you’ve seen with this tool and how you fixed it.

- What would you monitor/alert on for this tool in production?

Official Docs

https://prometheus.io/docs/introduction/overview/