CrashLoopBackOff

Why this matters

CrashLoopBackOff means a container is repeatedly starting and crashing. It usually points to a real bug or a bad configuration. Every restart wastes resources and delays user traffic; in production this often shows up as 5xx errors, timeouts, or failing background jobs.

Tip: Always confirm kubectl config current-context and namespace before diving into logs.

Symptoms

kubectl get pods shows STATUS CrashLoopBackOff for one or more pods.
Service or ingress in front of the workload is returning 5xx or timeouts.
Containers start, run for a few seconds, then exit with a non-zero code.
Logs show the same error pattern every time the pod restarts.

Common root causes

Application startup failures (missing env vars, invalid config, missing migrations).
Crash on dependency connection (DB not reachable, message broker auth failure).
Image mismatch vs expected config (new image expects different env/flags).
Fatal probe configuration (liveness/readiness probe failing on a new path/port).
Secrets or ConfigMaps missing, wrong key names, or wrong mount paths.

How KubeGraf helps

Highlights CrashLooping pods in the namespace so you don't hunt through raw kubectl output.
Shows restart counts, last state (e.g. Error, exit code), and recent events next to the pod.
Lets you pivot quickly between pod logs, events, Deployment, ConfigMap, and Secret.
Incident timeline view helps correlate: image deployed → config changed → probes failing → CrashLoopBackOff.

Step-by-step using KubeGraf UI

1. Confirm the problem in the right cluster/namespace

kubectl config current-context
kubectl get pods -n <namespace>

Start KubeGraf Terminal UI with kubegraf.
Ensure the context and namespace in KubeGraf match what you just checked.

2. Locate CrashLooping pods

Open the Pods view for the affected namespace.
Use filters to show only unhealthy pods (status CrashLoopBackOff / Error).
Note restart count, age, and container name (if multi-container pod).

3. Inspect recent events and reasons

From the pod details, open Events.
Look for messages such as Back-off restarting failed container, probe failures, or image pull errors.

4. Inspect logs around the crash

2025-03-22T12:01:03Z ERROR app Failed to start HTTP server: DB_CONNECTION_STRING not set
2025-03-22T12:01:03Z ERROR app Exiting with code 1

From the same pod, open Logs in KubeGraf and scroll to the last lines before exit.
Capture the exact error message and exit code.

5. Check configuration linked to the pod

From pod details, jump to its Deployment (or StatefulSet/Job).
Review container image tag, env vars, and probes.
Follow links to ConfigMaps and Secrets referenced by the pod and compare with what the app expects.

6. Use Incident Timeline / change history

Open the Incident Timeline for this workload/namespace.
Look for deploys, config updates, or probe changes just before the CrashLoopBackOff started.

7. Apply fix and watch recovery

Typical fixes include reverting a bad config/Secret, fixing missing env vars, or correcting probe path/port.

kubectl rollout undo deployment/<name> -n <namespace>
kubectl edit configmap <config-name> -n <namespace>

Use KubeGraf to watch new pods transition from CrashLoopBackOff to Running.

What to check next

Are other pods in the same Deployment also impacted, or only one replica?
Does the issue correlate with a specific node (node-local problem)?
Is the CrashLoop only in one namespace or across multiple environments (dev/staging/prod)?

Common mistakes

Debugging the wrong cluster/namespace because the kubeconfig context was not checked.
Only looking at logs and ignoring Events (probe misconfig is often obvious there).
Fixing a single pod manually instead of changing the Deployment/ConfigMap/Secret.
Rolling back an image without rolling back the config that was changed at the same time.

Related issues

Expected outcome

After following this playbook you should:

Identify whether the CrashLoopBackOff is due to configuration, code, or environment.
Know which change introduced the failure and either roll back or fix forward safely.
See pods return to Running and external symptoms (5xx, latency) disappear.

[ TODO: screenshot showing KubeGraf with a CrashLoopBackOff pod selected, logs + events visible. ]

Debug CrashLoopBackOff

Rollout Stuck Workflow