What Is Prometheus?
Prometheus is a advanced-level DevOps tool used to manage specific parts of software delivery and operations. It helps teams standardize workflows and reduce manual effort.
Monitoring & Observability
Prometheus collects metrics and powers alerting in DevOps systems.
Level: AdvancedPrometheus is a advanced-level DevOps tool used to manage specific parts of software delivery and operations. It helps teams standardize workflows and reduce manual effort.
Teams use Prometheus to improve speed, reliability, and consistency. It reduces repetitive manual work, lowers failure risk, and makes collaboration easier across development and operations.
It closes the feedback loop in production by showing system behavior through metrics, logs, and traces.
Start with core Prometheus concepts and basic setup so you can use it safely in day-to-day work.
- Understand Prometheus fundamentals
- Set up local/dev environment
- Run first working example
Integrate Prometheus into real team practices with repeatable conventions and collaboration patterns.
- Adopt standards and naming conventions
- Integrate with repositories and CI/CD
- Create reusable templates
Use Prometheus in production with observability, security, and rollback plans.
- Monitor behavior and failures
- Secure access and secrets
- Define incident and rollback flow
Continuously improve reliability, performance, and cost while standardizing usage across services.
- Improve performance and cost
- Automate compliance checks
- Document best practices for the team
- Scraping
- PromQL
- Alert rules
- Metric collection
- Query writing
- Alert design
- Incident detection and response
- Performance and reliability monitoring
- Root-cause analysis
- Read the Prometheus basics and terminology
- Run at least one hands-on mini project
- Break and fix a small setup to build confidence
- Document your first repeatable workflow
- Integrate Prometheus with your full delivery pipeline
- Add security and policy checks
- Add observability and incident playbooks
- Define reusable standards for multiple services
- Using defaults in production without security hardening
- Skipping monitoring and post-deployment validation
- No rollback strategy for failed changes
- Over-complex setup before mastering fundamentals
- Access control and least privilege applied
- Secrets managed securely
- Monitoring and alerting enabled
- Rollback and recovery process tested
- Documentation updated for team onboarding
Install Prometheus on host with practical commands and verification steps.
Install Prometheus package
sudo apt update && sudo apt install -y prometheusEnable and start Prometheus
sudo systemctl enable --now prometheusVerify target page
sudo systemctl status prometheus
curl -I http://localhost:9090Run Prometheus
docker run -p 9090:9090 prom/prometheusOpen UI
http://localhost:9090Run query
upSimple command list with short descriptions.
promtool check config prometheus.ymlValidate Prometheus config.
promtool check rules alerts.ymlValidate alert rule files.
curl http://localhost:9090/-/readyCheck readiness endpoint.
curl http://localhost:9090/-/healthyCheck health endpoint.
curl http://localhost:9090/api/v1/targetsList scrape targets via API.
curl http://localhost:9090/api/v1/rulesList active rules via API.
curl http://localhost:9090/api/v1/alertsList active alerts via API.
upPromQL: target is up (1) or down (0).
rate(http_requests_total[5m])PromQL: request rate over 5m.
sum(rate(http_requests_total[5m])) by (status)PromQL: rate by status.
histogram_quantile(0.95, sum(rate(request_duration_bucket[5m])) by (le))PromQL: p95 latency from histogram.
Official documentation:
https://prometheus.io/docs/introduction/overview/A full, structured guide for this tool (with commands, diagrams, best practices, and learning path).
A complete DevOpsLabX guide for Prometheus: what it is, why we use it, key concepts, commands, best practices, and how to learn it.
Prometheus collects metrics and powers alerting in DevOps systems.
A real, visual mental model of how Prometheus fits into a typical workflow.
Prometheus Workflow
This diagram is a practical mental model, not vendor-specific.
A production-oriented view: guardrails, checks, and the parts that matter when it breaks.
Production Reference Flow
This diagram is a practical mental model, not vendor-specific.
Scraping is a core idea you’ll use repeatedly while working with Prometheus.
Why it matters: Understanding Scraping helps you design safer workflows and troubleshoot issues faster.
Practice:
PromQL is a core idea you’ll use repeatedly while working with Prometheus.
Why it matters: Understanding PromQL helps you design safer workflows and troubleshoot issues faster.
Practice:
Alert rules is a core idea you’ll use repeatedly while working with Prometheus.
Why it matters: Understanding Alert rules helps you design safer workflows and troubleshoot issues faster.
Practice:
Start with core Prometheus concepts and basic setup so you can use it safely in day-to-day work.
Goals:
Integrate Prometheus into real team practices with repeatable conventions and collaboration patterns.
Goals:
Use Prometheus in production with observability, security, and rollback plans.
Goals:
Continuously improve reliability, performance, and cost while standardizing usage across services.
Goals:
docker run -p 9090:9090 prom/prometheus
http://localhost:9090
up
A tutorial-style sequence (like a handbook). Do these in order to build skill from beginner to production.
Goal: Create signals that help you debug incidents faster.
Steps:
Checkpoints:
Exercises:
Goal: Make debugging cross-service requests simpler.
Steps:
Checkpoints:
Exercises:
promtool check config prometheus.yml: Validate Prometheus config.promtool check rules alerts.yml: Validate alert rule files.curl http://localhost:9090/-/ready: Check readiness endpoint.curl http://localhost:9090/-/healthy: Check health endpoint.curl http://localhost:9090/api/v1/targets: List scrape targets via API.curl http://localhost:9090/api/v1/rules: List active rules via API.curl http://localhost:9090/api/v1/alerts: List active alerts via API.up: PromQL: target is up (1) or down (0).rate(http_requests_total[5m]): PromQL: request rate over 5m.sum(rate(http_requests_total[5m])) by (status): PromQL: rate by status.histogram_quantile(0.95, sum(rate(request_duration_bucket[5m])) by (le)): PromQL: p95 latency from histogram.What to learn:
Hands-on labs:
Milestones:
What to learn:
Hands-on labs:
Milestones:
What to learn:
Hands-on labs:
Milestones:
Use these templates to make your docs feel like real production documentation.
Too many alerts and the team ignores them
Likely cause: Alerting on causes not symptoms; thresholds too sensitive
Fix steps:
Prometheus is used to standardize and automate parts of delivery and operations so teams can ship faster and more reliably.
You can get productive in days with fundamentals, but production mastery comes from building workflows, debugging failures, and operating it over time.
Learn basic Linux + Git first, then follow the prerequisites section. Fundamentals make every advanced topic easier.
Add guardrails: least privilege, validation before apply/deploy, monitoring, and a tested rollback plan.
Extra long-form notes for Prometheus. This loads on demand so the page stays fast.