Kubernetes for DevOps: Pods, Deployments, Services, and Day-2 Operations

Understand the core objects you will touch every day on a real cluster, plus the operational gotchas no tutorial mentions.

By Admin · 5/28/2026

Kubernetes wheel logo on a dark grid background

Kubernetes for DevOps: Pods, Deployments, Services, and Day-2 Operations

Kubernetes cluster: control plane with apiserver, etcd, scheduler, controller-manager — and worker nodes running pods

Kubernetes is a container orchestrator. It schedules workloads across a fleet of nodes, restarts crashed processes, exposes them on the network, and rolls out new versions — all declaratively. You describe the desired state in YAML, the platform converges to it, and it keeps converging when nodes fail or load shifts.

The downside is conceptual surface area. There are dozens of object types, three or four ways to do most things, and a long tail of "you only learn this when something breaks" details. This guide covers the core objects you will use daily, and the day-2 operational patterns that separate a stable cluster from a noisy one.

Architecture in one paragraph

A Kubernetes cluster has a control plane (kube-apiserver, etcd, scheduler, controller-manager) and worker nodes running kubelet and a container runtime (containerd or CRI-O). The apiserver is the only component that talks to etcd. Controllers watch the apiserver for changes to objects and reconcile state. The scheduler decides which node a new pod should land on. kubelet on each node ensures the pods assigned to it are actually running.

Everything you will do as a user — kubectl apply, kubectl logs, kubectl exec — is mediated by the apiserver. There is no shortcut around it.

Pods — the unit of scheduling

A Pod is one or more containers that share a network namespace and volumes. They are co-scheduled and co-located. In practice, you almost never create raw Pods; you let a higher-level controller (a Deployment, StatefulSet, DaemonSet, or Job) manage them for you. Raw pods do not get rescheduled when a node dies.

Multi-container pods are useful for tightly coupled helpers — a sidecar that ships logs, an init container that downloads a model file before the main container starts. Resist using them as a substitute for a real service mesh or message bus.

Deployments — declarative rollouts

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels: { app: web }
  strategy:
    type: RollingUpdate
    rollingUpdate: { maxSurge: 1, maxUnavailable: 0 }
  template:
    metadata:
      labels: { app: web }
    spec:
      containers:
        - name: web
          image: myapp:1.0
          ports: [{ containerPort: 3000 }]
          resources:
            requests: { cpu: "100m", memory: "128Mi" }
            limits:   { cpu: "500m", memory: "512Mi" }
          livenessProbe:
            httpGet: { path: /healthz, port: 3000 }
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet: { path: /ready, port: 3000 }
            periodSeconds: 5

kubectl apply -f web.yaml and the Deployment controller will keep three healthy replicas. Push a new image tag, re-apply, and it performs a rolling update bounded by the maxSurge and maxUnavailable settings. kubectl rollout undo deployment/web rolls back to the previous ReplicaSet — a feature many teams discover only during incidents.

Anatomy of a rolling update

Rolling update progression: surge a new pod, mark it Ready, terminate an old pod, repeat until all replicas are on the new ReplicaSet

The diagram above shows what actually happens behind kubectl set image or a re-apply that bumps an image tag.

Internally, every Deployment owns one or more ReplicaSets. The active ReplicaSet ensures replicas healthy pods exist. When you change the pod template (a new image tag, a different env var), the Deployment controller creates a new ReplicaSet for the new template and starts a controlled handoff between the two.

Walking through the timeline in the diagram with replicas: 4, maxSurge: 1, maxUnavailable: 0:

  1. t0. Four v1 pods are Ready; the Service routes traffic to all four.
  2. t1. The new ReplicaSet creates one v2 pod (this is the "surge" — we now have 5 pods total). Traffic still goes only to the four v1 pods. Once the new pod's readiness probe succeeds, the Service starts sending it traffic.
  3. t2. With one v2 ready, an old v1 pod is signaled to terminate. As soon as that v1 leaves the Service's endpoint list, the new ReplicaSet creates the next v2 pod.
  4. t3. The handoff repeats until all four pods are v2. The old ReplicaSet is scaled to 0 but kept around for rollback.

Two subtleties matter for production:

  • Readiness probes are the heartbeat of a safe rollout. A pod that comes up but is not actually serving traffic (still loading caches, opening DB pools) will receive requests and 5xx them if you skip readiness. With maxUnavailable: 0, the controller will not terminate an old pod until a new one is Ready — but only if you actually configured the probe.
  • PreStop hooks and termination grace. When a pod is signaled to terminate, the Service removes it from endpoints, but in-flight requests need time to complete. Configure a preStop sleep (5–10 seconds) and a terminationGracePeriodSeconds long enough for your slowest in-flight request — otherwise users see connection resets at every rollout.

For risky changes you can also use kubectl rollout pause deployment/web to halt midway, observe metrics, then kubectl rollout resume (or undo) based on what you see. Combined with a metrics dashboard scoped to the new ReplicaSet's labels, this is a poor-man's progressive delivery without bringing in Argo Rollouts or Flagger.

The Deployment.spec.revisionHistoryLimit controls how many old ReplicaSets are retained for rollback — the default of 10 is usually fine; lower it on clusters with thousands of Deployments to cut etcd load.

Services — stable network identity

Pods come and go; their IPs change. A Service gives a set of pods a stable virtual IP (the ClusterIP) and a DNS name inside the cluster:

apiVersion: v1
kind: Service
metadata: { name: web }
spec:
  selector: { app: web }
  ports: [{ port: 80, targetPort: 3000 }]

Inside the cluster, any pod can hit http://web and reach a healthy replica. Behind the scenes, kube-proxy programs iptables (or IPVS, or eBPF rules) on each node so traffic to the ClusterIP is load-balanced to the pods that match the selector.

For external traffic you typically use an Ingress (HTTP/HTTPS routing handled by an ingress controller like NGINX or Traefik) or, in cloud environments, a Service of type LoadBalancer that the cloud provider wires up to a cloud load balancer.

Configuration — ConfigMaps and Secrets

Bake nothing environment-specific into your image. Inject configuration at runtime with ConfigMaps and Secrets:

apiVersion: v1
kind: ConfigMap
metadata: { name: web-config }
data:
  LOG_LEVEL: "info"
  FEATURE_FLAGS: "search,beta-checkout"
---
apiVersion: v1
kind: Secret
metadata: { name: web-secrets }
type: Opaque
stringData:
  DATABASE_URL: "postgres://app:****@db/app"

Mount them as environment variables or files in the pod spec. Secrets are not encrypted at rest by default — enable encryption providers in the apiserver, or use an external KMS-backed solution like External Secrets Operator if you handle real production credentials.

Storage: PVs, PVCs, and StorageClasses

Most workloads are stateless, but the ones that are not (databases, caches with persistence, stateful queues) need durable storage. Kubernetes's storage model has three pieces:

  • A PersistentVolume (PV) is a piece of storage in the cluster — an EBS volume, a GCE persistent disk, an NFS export. PVs can be provisioned manually or dynamically.
  • A PersistentVolumeClaim (PVC) is a request for storage from a pod. The PVC specifies size, access mode (ReadWriteOnce, ReadOnlyMany, ReadWriteMany), and optionally a StorageClass.
  • A StorageClass describes how to dynamically provision new PVs of a particular flavor (gp3 SSD on AWS, balanced disk on GCE). The cluster's storage controller fulfills PVCs by creating new PVs from the requested StorageClass.

A typical pattern: a StatefulSet declares a volumeClaimTemplate. When pod db-0 is created, the cluster creates a PVC, which triggers dynamic provisioning of a PV via the StorageClass. The PV is bound to the PVC for the lifetime of the pod's identity — even if the pod is rescheduled to a different node, it remounts the same volume. Delete the StatefulSet and the PVCs (and underlying PVs) remain by default, so you do not accidentally wipe a database by misclicking.

For production data, set the StorageClass reclaimPolicy: Retain so deleted PVCs leave the underlying volume in place for human recovery. Snapshots and backups still belong to your application, not to Kubernetes — the platform keeps the disk alive, but a poisoned schema migration is your problem to roll back.

Day-2 essentials

Getting a service running is the easy part. Keeping it healthy under load and during failures is where Kubernetes earns its keep, but only if you actually configure the relevant features.

  • Liveness and readiness probes — let the platform restart unhealthy pods and stop sending traffic to ones that are still warming up. Liveness too aggressive will kill pods during legitimate slow paths; tune the thresholds.
  • Resource requests and limits — requests drive scheduling; limits enforce quotas. A pod with no requests can be scheduled anywhere and starve neighbors. A pod with no limits can OOM the entire node.
  • HorizontalPodAutoscaler — scale deployments on CPU, memory, or custom metrics from Prometheus.
  • PodDisruptionBudgets — guarantee a minimum number of replicas stay available during voluntary disruptions like node drains.
  • NetworkPolicies — by default every pod can reach every other pod. NetworkPolicies let you implement least privilege at the network layer.
  • Namespaces — use them to isolate teams or environments and apply ResourceQuotas and LimitRanges per namespace.

Observability

A Kubernetes cluster without metrics is a black box. The minimum viable stack:

  • kube-state-metrics and metrics-server for cluster-state metrics and pod resource usage.
  • A Prometheus instance scraping both the cluster metrics and your apps' /metrics endpoints.
  • A log aggregation pipeline (Loki, Fluent Bit → Elasticsearch, Datadog, etc.) collecting container stdout.
  • A Grafana instance with dashboards for cluster health, namespace utilization, and per-service golden signals.

Without these, debugging a slow service means kubectl logs whack-a-mole.

Common pitfalls

  • No resource requests on workloads — the scheduler treats them as best-effort and they get evicted first under pressure.
  • Using latest image tags — rollouts become non-reproducible, and kubectl rollout undo does not work as expected.
  • Putting state in a Deployment instead of a StatefulSet — pod renames and missing stable network identity will bite you.
  • Sharing one ServiceAccount across namespaces — least privilege evaporates.
  • Manual edits to live resources with kubectl edit — your YAML in git is no longer source of truth.

Debugging when things go wrong

When a pod is unhappy, work top-down through these commands:

kubectl get pods -n <ns>                       # status, restarts, age
kubectl describe pod <pod> -n <ns>             # events, scheduling, probe results
kubectl logs <pod> -n <ns> --previous          # logs from the last crashed container
kubectl exec -it <pod> -n <ns> -- sh           # shell inside the container
kubectl get events -n <ns> --sort-by=.lastTimestamp
kubectl top pod -n <ns>                        # live CPU/mem (needs metrics-server)

describe and events solve most "why is this pod pending" or "why did it restart" questions before you ever need to look at logs. A pending pod with a "0/3 nodes available" event almost always means resource requests cannot be satisfied. A CrashLoopBackOff with a non-zero exit code in the previous container's logs is your application crashing on startup — usually a misconfigured environment variable or a missing dependency.

A simple operational rhythm

  1. Define every workload in YAML in a git repo.
  2. Apply via a GitOps tool (Argo CD, Flux) so the cluster always matches git.
  3. Promote between environments by changing image tags in a values file, not by re-running scripts.
  4. Watch the four golden signals (latency, traffic, errors, saturation) per service.
  5. Run a monthly chaos drill — drain a node, kill a pod, fail a control plane component — and make sure your alerts fire and your runbooks work.

These five primitives — Pods, Deployments, Services, ConfigMaps, and probes — plus a disciplined GitOps workflow cover roughly 80% of what platform engineers configure on a typical cluster. The remaining 20% is the long tail you learn one incident at a time.

If you are starting from zero, resist the urge to learn everything at once. Build a small Deployment, expose it with a Service, add probes, add resource requests, then move to ConfigMaps and Secrets. Each step builds on the previous one and gives you a working system you can prod. The "learn the entire object catalog before you ship anything" approach almost always stalls.

Topic cluster

More kubernetes Articles

Latest related posts connected by shared tags.

Continue learning

Related internal resources

Jump deeper with documentation, cheat sheets, and the full roadmap.