Kubernetes Deployment Rollout Stuck / ProgressDeadlineExceeded

Why this matters

A stuck rollout means a new version of your application never becomes healthy. If you don't catch it quickly, traffic may be served by a half-updated fleet, or the rollout might silently stall during an incident.

Warning: Do not delete pods manually to unblock a rollout; fix the Deployment spec or roll back instead.

Symptoms

kubectl rollout status deployment/<name> hangs or reports progress deadline exceeded.
Some pods for the Deployment are Pending, CrashLoopBackOff, or failing probes.
New pods are never marked Ready, or the number of Ready replicas never reaches the desired count.
HPA or traffic routing still points partially at old replicas.

Common root causes

Broken image (application fails to start, missing dependency).
Liveness/readiness probes misconfigured for the new version.
Resource requests too high for available nodes, leaving pods unschedulable.
PodDisruptionBudget, affinity/anti-affinity, or node selectors preventing enough replicas from running.
Networking/DNS issues preventing the app from reaching required backends.

How KubeGraf helps

Shows rollout status visually: desired vs updated vs available replicas.
Highlights pods that are blocking progress (probe failures, CrashLoopBackOff, Pending with scheduling errors).
Exposes Events attached to pods and the Deployment in one place.
Lets you inspect the spec diff between the previous ReplicaSet and the new one.

Step-by-step using KubeGraf UI

1. Confirm the rollout is stuck

kubectl rollout status deployment/<name> -n <namespace>

Note any message like progress deadline exceeded. Then:

Open KubeGraf and select the correct cluster and namespace.
Verify the Deployment name matches what you checked with kubectl.

2. Open the Deployment view

In KubeGraf, go to Deployments and select the affected Deployment.
Check the summary: desired, updated, and available/Ready replicas.

3. Identify blocking pods

From the Deployment details, open the linked ReplicaSets and Pods.
Look for pods in Pending, CrashLoopBackOff, or with failing readiness probes.
Use filters to narrow the list to non-Ready pods.

4. Inspect Events for scheduling or probe issues

On a problematic pod, open Events.
Look for messages such as:

0/3 nodes are available: 3 Insufficient cpu/memory
Readiness probe failed: HTTP 500
FailedScheduling: 0/3 nodes available due to taints

5. Compare new vs previous ReplicaSet spec

In the Deployment view, open the history / ReplicaSets panel.
Compare the new ReplicaSet to the previous one:

Image tag.
Resource requests/limits.
Probes (paths, ports, thresholds).
Env vars and config references.

6. Decide: roll back vs fix forward

If the change is clearly broken and you need fast recovery:

kubectl rollout undo deployment/<name> -n <namespace>
kubectl rollout status deployment/<name> -n <namespace>

Watch in KubeGraf as pods for the previous ReplicaSet return to Ready.
If you can fix forward (e.g. adjust probe or config), update the spec via code and apply through CI/GitOps.

7. Verify impact at the service level

In KubeGraf, move to the Topology or Services view.
Confirm Service endpoints are all Ready and no backend endpoints are NotReady.

What to check next

Are other Deployments rolling out at the same time on the same nodes (resource contention)?
Is there a cluster-wide issue (node pressure, CNI problems) reflected in Events?
Are there PDBs or policies that restrict how many pods can be unavailable during rollout?

Common mistakes

Focusing only on the Deployment object and ignoring pod-level Events.
Forgetting to check node-level constraints when pods are Pending.
Rolling back image but leaving an incompatible probe or config in place.
Manually deleting pods to "unstick" a rollout instead of fixing the spec.

Related issues

Expected outcome

After following this playbook you should:

Understand why the rollout is stalled (scheduling, crash, probes, or config).
Either roll back safely or apply a corrected spec that converges to the desired Ready replicas.
Be able to monitor future rollouts of this Deployment in KubeGraf with clear visibility into progress.

[ TODO: screenshot showing KubeGraf Deployment view with rollout status and blocking pods highlighted. ]

CrashLoopBackOff

High CPU/Memory