CrashLoopBackOff

Why this matters

CrashLoopBackOff means a container is repeatedly starting and crashing. It usually points to a real bug or a bad configuration. Every restart wastes resources and delays user traffic; in production this often shows up as 5xx errors, timeouts, or failing background jobs.

Tip: Always confirm kubectl config current-context and namespace before diving into logs.

Symptoms

  • kubectl get pods shows STATUS CrashLoopBackOff for one or more pods.
  • Service or ingress in front of the workload is returning 5xx or timeouts.
  • Containers start, run for a few seconds, then exit with a non-zero code.
  • Logs show the same error pattern every time the pod restarts.

Common root causes

  • Application startup failures (missing env vars, invalid config, missing migrations).
  • Crash on dependency connection (DB not reachable, message broker auth failure).
  • Image mismatch vs expected config (new image expects different env/flags).
  • Fatal probe configuration (liveness/readiness probe failing on a new path/port).
  • Secrets or ConfigMaps missing, wrong key names, or wrong mount paths.

How KubeGraf helps

  • Highlights CrashLooping pods in the namespace so you don't hunt through raw kubectl output.
  • Shows restart counts, last state (e.g. Error, exit code), and recent events next to the pod.
  • Lets you pivot quickly between pod logs, events, Deployment, ConfigMap, and Secret.
  • Incident timeline view helps correlate: image deployed → config changed → probes failing → CrashLoopBackOff.

Step-by-step using KubeGraf UI

1. Confirm the problem in the right cluster/namespace

kubectl config current-context
kubectl get pods -n <namespace>
  • Start KubeGraf Terminal UI with kubegraf.
  • Ensure the context and namespace in KubeGraf match what you just checked.

2. Locate CrashLooping pods

  • Open the Pods view for the affected namespace.
  • Use filters to show only unhealthy pods (status CrashLoopBackOff / Error).
  • Note restart count, age, and container name (if multi-container pod).

3. Inspect recent events and reasons

  • From the pod details, open Events.
  • Look for messages such as Back-off restarting failed container, probe failures, or image pull errors.

4. Inspect logs around the crash

2025-03-22T12:01:03Z ERROR app Failed to start HTTP server: DB_CONNECTION_STRING not set
2025-03-22T12:01:03Z ERROR app Exiting with code 1
  • From the same pod, open Logs in KubeGraf and scroll to the last lines before exit.
  • Capture the exact error message and exit code.

5. Check configuration linked to the pod

  • From pod details, jump to its Deployment (or StatefulSet/Job).
  • Review container image tag, env vars, and probes.
  • Follow links to ConfigMaps and Secrets referenced by the pod and compare with what the app expects.

6. Use Incident Timeline / change history

  • Open the Incident Timeline for this workload/namespace.
  • Look for deploys, config updates, or probe changes just before the CrashLoopBackOff started.

7. Apply fix and watch recovery

  • Typical fixes include reverting a bad config/Secret, fixing missing env vars, or correcting probe path/port.
kubectl rollout undo deployment/<name> -n <namespace>
kubectl edit configmap <config-name> -n <namespace>
  • Use KubeGraf to watch new pods transition from CrashLoopBackOff to Running.

What to check next

  • Are other pods in the same Deployment also impacted, or only one replica?
  • Does the issue correlate with a specific node (node-local problem)?
  • Is the CrashLoop only in one namespace or across multiple environments (dev/staging/prod)?

Common mistakes

  • Debugging the wrong cluster/namespace because the kubeconfig context was not checked.
  • Only looking at logs and ignoring Events (probe misconfig is often obvious there).
  • Fixing a single pod manually instead of changing the Deployment/ConfigMap/Secret.
  • Rolling back an image without rolling back the config that was changed at the same time.

Related issues

Expected outcome

After following this playbook you should:

  • Identify whether the CrashLoopBackOff is due to configuration, code, or environment.
  • Know which change introduced the failure and either roll back or fix forward safely.
  • See pods return to Running and external symptoms (5xx, latency) disappear.
[ TODO: screenshot showing KubeGraf with a CrashLoopBackOff pod selected, logs + events visible. ]