Kubernetes Deployment Rollout Stuck / ProgressDeadlineExceeded

Why this matters

A stuck rollout means a new version of your application never becomes healthy. If you don't catch it quickly, traffic may be served by a half-updated fleet, or the rollout might silently stall during an incident.

Warning: Do not delete pods manually to unblock a rollout; fix the Deployment spec or roll back instead.

Symptoms

  • kubectl rollout status deployment/<name> hangs or reports progress deadline exceeded.
  • Some pods for the Deployment are Pending, CrashLoopBackOff, or failing probes.
  • New pods are never marked Ready, or the number of Ready replicas never reaches the desired count.
  • HPA or traffic routing still points partially at old replicas.

Common root causes

  • Broken image (application fails to start, missing dependency).
  • Liveness/readiness probes misconfigured for the new version.
  • Resource requests too high for available nodes, leaving pods unschedulable.
  • PodDisruptionBudget, affinity/anti-affinity, or node selectors preventing enough replicas from running.
  • Networking/DNS issues preventing the app from reaching required backends.

How KubeGraf helps

  • Shows rollout status visually: desired vs updated vs available replicas.
  • Highlights pods that are blocking progress (probe failures, CrashLoopBackOff, Pending with scheduling errors).
  • Exposes Events attached to pods and the Deployment in one place.
  • Lets you inspect the spec diff between the previous ReplicaSet and the new one.

Step-by-step using KubeGraf UI

1. Confirm the rollout is stuck

kubectl rollout status deployment/<name> -n <namespace>

Note any message like progress deadline exceeded. Then:

  • Open KubeGraf and select the correct cluster and namespace.
  • Verify the Deployment name matches what you checked with kubectl.

2. Open the Deployment view

  • In KubeGraf, go to Deployments and select the affected Deployment.
  • Check the summary: desired, updated, and available/Ready replicas.

3. Identify blocking pods

  • From the Deployment details, open the linked ReplicaSets and Pods.
  • Look for pods in Pending, CrashLoopBackOff, or with failing readiness probes.
  • Use filters to narrow the list to non-Ready pods.

4. Inspect Events for scheduling or probe issues

  • On a problematic pod, open Events.
  • Look for messages such as:
0/3 nodes are available: 3 Insufficient cpu/memory
Readiness probe failed: HTTP 500
FailedScheduling: 0/3 nodes available due to taints

5. Compare new vs previous ReplicaSet spec

  • In the Deployment view, open the history / ReplicaSets panel.
  • Compare the new ReplicaSet to the previous one:
  • Image tag.
  • Resource requests/limits.
  • Probes (paths, ports, thresholds).
  • Env vars and config references.

6. Decide: roll back vs fix forward

  • If the change is clearly broken and you need fast recovery:
kubectl rollout undo deployment/<name> -n <namespace>
kubectl rollout status deployment/<name> -n <namespace>
  • Watch in KubeGraf as pods for the previous ReplicaSet return to Ready.
  • If you can fix forward (e.g. adjust probe or config), update the spec via code and apply through CI/GitOps.

7. Verify impact at the service level

  • In KubeGraf, move to the Topology or Services view.
  • Confirm Service endpoints are all Ready and no backend endpoints are NotReady.

What to check next

  • Are other Deployments rolling out at the same time on the same nodes (resource contention)?
  • Is there a cluster-wide issue (node pressure, CNI problems) reflected in Events?
  • Are there PDBs or policies that restrict how many pods can be unavailable during rollout?

Common mistakes

  • Focusing only on the Deployment object and ignoring pod-level Events.
  • Forgetting to check node-level constraints when pods are Pending.
  • Rolling back image but leaving an incompatible probe or config in place.
  • Manually deleting pods to "unstick" a rollout instead of fixing the spec.

Related issues

Expected outcome

After following this playbook you should:

  • Understand why the rollout is stalled (scheduling, crash, probes, or config).
  • Either roll back safely or apply a corrected spec that converges to the desired Ready replicas.
  • Be able to monitor future rollouts of this Deployment in KubeGraf with clear visibility into progress.
[ TODO: screenshot showing KubeGraf Deployment view with rollout status and blocking pods highlighted. ]