Kubernetes incidents often feel noisy because several layers can fail at once: the application, the container image, the pod spec, the node, the network, storage, or the rollout itself. This checklist is designed to slow that down. Instead of jumping between dashboards and guesses, you can work through a repeatable Kubernetes troubleshooting guide that helps isolate symptoms, confirm impact, and narrow root causes before making changes. Keep it handy for pod failures, networking issues, persistent volume problems, and broken deployments.
Overview
This article gives you a reusable kubernetes checklist for common operational issues. The goal is not to memorize every command or edge case. The goal is to follow the same order of checks each time so your team can debug a Kubernetes cluster with less rework and fewer risky fixes.
A good troubleshooting workflow usually follows this sequence:
- Define the symptom clearly. What is actually broken: scheduling, startup, readiness, service reachability, storage attachment, or a rollout?
- Scope the blast radius. Is it one pod, one deployment, one namespace, one node pool, or the whole cluster?
- Check recent change first. New image, config, secret, manifest edit, autoscaling event, node update, policy change, or ingress adjustment.
- Start with Kubernetes status signals. Pod phase, events, conditions, restart count, deployment status, node status, PVC state.
- Then inspect logs and runtime behavior. Container logs, previous container logs, probes, entrypoint behavior, and application health.
- Only then apply a fix. Avoid changing multiple variables at once.
In practice, that means beginning with a few baseline questions:
- What changed recently?
- Which namespace and workload are affected?
- Is the issue isolated or widespread?
- What do events say?
- Do the pod status and logs support the same explanation?
If your team does nothing else, standardizing those first checks will improve k8s pod troubleshooting significantly. Many incidents become longer than necessary because engineers skip straight to restarting pods or editing manifests without confirming the real failure mode.
Checklist by scenario
Use this section as the operational core of your troubleshooting process. Start with the scenario that best matches the symptom, then work down the checks in order.
1. Pod is Pending and never starts
If a pod stays in Pending, the problem is usually scheduling, resource availability, or a dependency the scheduler cannot satisfy.
- Describe the pod and read the event stream first.
- Check whether the issue is insufficient CPU, memory, ephemeral storage, or a missing node label.
- Confirm node selectors, affinity rules, anti-affinity rules, taints, and tolerations.
- Check whether the namespace has quotas or limit ranges blocking admission.
- Review PersistentVolumeClaim binding if the pod mounts storage.
- Confirm the referenced service account, secret, or config map exists if admission depends on it.
Common signals: unschedulable events, unbound PVCs, taint mismatch, or strict affinity rules that narrow placement too much.
2. Pod is in CrashLoopBackOff or keeps restarting
This is one of the most common Kubernetes common issues because it blends application failure with platform behavior. A restart loop does not automatically mean Kubernetes is broken.
- Check current logs and previous logs for the container.
- Confirm whether the process exits immediately, fails health checks, or is killed for resource reasons.
- Review the command, args, environment variables, config mounts, and secret references.
- Check liveness and readiness probes for path, port, timeout, and startup timing.
- Inspect restart count and event history.
- Verify the image tag and image pull policy align with what you intended to run.
- Check whether the app expects a dependency such as a database, queue, or API that is unavailable.
Common signals: bad configuration, missing secret keys, wrong startup command, aggressive liveness probes, or application boot time longer than expected.
3. ImagePullBackOff or ErrImagePull
When the image cannot be pulled, focus on naming, registry access, and credentials before anything else.
- Verify the image name, registry host, repository path, and tag.
- Check whether the image actually exists in the target registry.
- Confirm image pull secrets are present and attached correctly.
- Review service account configuration if the pull secret is inherited there.
- Check whether network policy, proxy settings, or egress restrictions prevent registry access.
- Make sure the cluster nodes can resolve and reach the registry endpoint.
Common signals: wrong tag, stale credentials, missing pull secret, private registry mismatch, or DNS/connectivity issues to the registry.
4. Pod is Running but not Ready
A running pod that never becomes ready usually points to probe configuration, dependency readiness, or application-level initialization that never completes.
- Inspect readiness probe configuration carefully: path, port, scheme, headers, timeout, and initial delay.
- Confirm the container is listening on the expected interface and port.
- Check logs for boot completion, migration tasks, or dependency timeouts.
- Test the health endpoint from inside the pod if possible.
- Make sure the app does not bind only to localhost when the probe expects pod IP reachability.
- Check startup probe settings if the application takes a long time to initialize.
Common signals: mismatched health endpoint, wrong port, slow startup, or readiness checks depending on external services too early.
5. Service is unreachable inside the cluster
If one workload cannot reach another, move from DNS to service definition to endpoint selection.
- Confirm the client pod can resolve the service DNS name.
- Check the service type, port, targetPort, and protocol.
- Verify the service selector matches the intended pods.
- Inspect endpoints or endpoint slices to confirm backing pods are registered.
- Check whether the target pods are actually Ready.
- Review network policies on both source and destination namespaces.
- Test connectivity directly to pod IPs to separate service problems from application problems.
Common signals: selector mismatch, wrong targetPort, empty endpoints, or deny-by-default network policy.
For broader HTTP behavior, a dedicated HTTP status code troubleshooting guide can help when Kubernetes networking looks healthy but the application still returns failing responses.
6. Ingress or external traffic is failing
When traffic from outside the cluster does not reach the service, walk the request path in order rather than guessing at the edge.
- Confirm the ingress resource points to the correct service name and service port.
- Check ingress class configuration and whether the correct controller owns the resource.
- Verify DNS points to the right external address.
- Inspect TLS secret references and certificate validity if HTTPS is involved.
- Check controller logs for routing, rule, or backend errors.
- Test the service directly from inside the cluster to confirm the backend is healthy before debugging ingress.
- Review path matching and rewrite rules for subtle route mismatches.
Common signals: wrong ingress class, bad backend port, stale DNS, TLS mismatch, or backend service not actually serving traffic.
7. Persistent volume or storage mount problems
Storage issues are often a mix of claim state, access mode expectations, and node-level attachment behavior.
- Check whether the PVC is
Bound. - Confirm storage class, access mode, requested size, and volume mode are valid for the workload.
- Describe the pod and PVC to review mount or attachment events.
- Check whether the workload expects multi-writer access while the volume supports only a single writer.
- Verify the node can attach the volume in the current topology or zone.
- Inspect file permissions, ownership, and security context if the mount exists but the app cannot use it.
Common signals: unbound claims, access mode mismatch, zone mismatch, attachment failure, or permission problems inside the mounted path.
8. Deployment rollout is stuck or behaving unexpectedly
Broken rollouts can appear as partial updates, unavailable replicas, or repeated replacement of pods that never become healthy.
- Check deployment status, conditions, and progress deadline events.
- Review the associated ReplicaSets to confirm which version is active and which is failing.
- Inspect readiness failures before scaling or forcing a restart.
- Confirm maxUnavailable and maxSurge settings match your tolerance for disruption.
- Compare the live manifest to the intended change: image, env vars, config refs, probes, resources, labels, and selectors.
- Pause and inspect rather than repeatedly reapplying manifests during an incident.
Common signals: bad image, readiness failures, immutable field misunderstanding, selector mismatch, or rollout strategy settings that hide the real issue.
9. Node-level problems affect multiple workloads
If several unrelated pods fail together, widen the scope. The node or cluster control path may be the real source of trouble.
- Check node conditions for memory pressure, disk pressure, readiness, and network availability.
- Look for container runtime issues or kubelet instability.
- Review whether recent node upgrades, autoscaling events, or drain operations correlate with the incident.
- Compare failing workloads by node placement to see whether impact clusters around specific hosts.
- Check DaemonSet health for networking, logging, storage, or security agents that run on every node.
Common signals: pressure conditions, kubelet or runtime instability, CNI problems, or node-specific capacity exhaustion.
10. Config or secret changes did not take effect
This scenario often causes confusion because the cluster may be healthy while the workload is still using stale values.
- Confirm the resource was updated in the correct namespace.
- Check whether the application loads config only at startup.
- Verify mount-based config versus environment variable injection, since update behavior differs.
- Make sure the deployment template actually references the intended config map or secret name.
- Review rollout history to confirm a new ReplicaSet was triggered when required.
Common signals: wrong namespace, stale pod instance, unchanged pod template, or a mistaken assumption that all config updates are live-reloaded.
When reviewing raw manifests or generated values, formatting mistakes can slow diagnosis. Tools and habits covered in JSON vs YAML vs TOML: Which Config Format Is Best for Developer Workflows? and JSON Formatter and Validator Tools: Which One Should Developers Use? can help reduce syntax-related confusion during incident work.
What to double-check
These are the checks that experienced teams repeat because they catch a surprising number of incidents.
Recent change history
- New image tag
- Changed environment variable
- Secret rotation
- Config map update
- Ingress edit
- Network policy rollout
- Node pool upgrade or autoscaler event
If you only have time for one high-value question, ask: what changed just before the issue appeared?
Namespace and context mistakes
Many debugging mistakes are simple context errors. Make sure you are looking at the correct cluster, namespace, and workload name. In multi-cluster environments, this matters even more than people expect.
Labels and selectors
Services, deployments, and policies depend heavily on label matching. A single label drift can break routing, rollout ownership, or policy behavior without producing an obvious application error.
Ports, target ports, and health checks
Port mismatches are easy to miss because the pod can still appear healthy from one angle while failing from another. Verify containerPort, service port, targetPort, probe port, and ingress backend port line up exactly.
Resource requests and limits
Requests affect scheduling. Limits affect runtime enforcement. If a pod never schedules, the request may be too high. If it restarts under load, the limit may be too low or memory behavior may be poorly understood.
Manifest rendering and generated config
If you use Helm, Kustomize, templating, or CI-generated manifests, inspect the rendered output, not just the source template. A valid template can still produce an invalid or unintended manifest after values are merged.
Common mistakes
This section helps prevent the habits that turn a manageable issue into a longer outage.
- Restarting pods before collecting evidence. This can erase useful state, especially for short-lived failures.
- Changing several things at once. If you modify probes, resources, and image tags together, you may not know which change fixed or worsened the issue.
- Assuming the symptom is the root cause. A readiness failure may be caused by DNS. A crash loop may be caused by a missing secret. A storage error may actually be topology-related.
- Ignoring events. Kubernetes events are often the fastest route to the right category of problem.
- Debugging only from the application side. Many incidents cross control plane, node, and network layers.
- Forgetting dependency health. The workload may be correct while its database, queue, or upstream API is not.
- Treating every failure as a Kubernetes failure. Sometimes the platform is behaving correctly and surfacing an application or configuration defect.
It also helps to document repeat fixes in a shared internal runbook. A checklist becomes more valuable when it reflects your cluster conventions: your ingress controller, your CNI, your storage classes, your observability stack, and your deployment process.
When to revisit
This checklist should be treated as a living operational document, not a one-time article. Revisit and update it whenever your workflows or platform assumptions change.
Good moments to review it include:
- Before planning cycles. Refresh your runbooks before high-change periods, migrations, or seasonal traffic changes.
- After major incidents. Add the checks that would have shortened diagnosis.
- When platform tooling changes. New ingress controllers, service mesh adoption, policy engines, or storage backends change what “normal” looks like.
- When your deployment workflow changes. CI/CD updates often introduce new failure modes around rendering, promotion, and rollout control.
- When the team grows. A checklist should be readable by someone who did not build the original cluster.
To make this practical, end each incident with a short update routine:
- Record the visible symptom.
- Record the actual root cause.
- Note the first signal that should have pointed you there faster.
- Add one new check or one clarified step to your internal version of this checklist.
- Link the runbook to related tooling guides your team already uses.
For example, teams often pair Kubernetes debugging with API and payload inspection tools. Depending on the incident, related references such as Curl vs HTTPie vs Postman: Best API Testing Tools for Fast Debugging, URL Encoder and Decoder Guide for APIs, Base64 Encode and Decode Tools, or JWT Decoder Guide: How to Inspect Tokens Safely Without Leaking Secrets can help when the issue sits between Kubernetes routing and application authentication.
The most effective kubernetes troubleshooting guide is not the longest one. It is the one your team actually uses under pressure. Start with the workflow here, adapt it to your cluster, and keep refining it as your platform evolves.