name: openshift-debugging description: Diagnose OpenShift cluster issues using layered triage, failure-mode classification, and prioritized diagnostic workflows.
OpenShift Debugging
Triage: Identify the Layer
Every cluster issue lives in one of three layers. Identify the layer first — it determines your diagnostic path.
- Application layer (pods, deployments, statefulsets): Symptoms are pod-level — CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled. Start with
oc describe podandoc logs. - Platform layer (operators, controllers, API server): Symptoms are cluster-wide — degraded ClusterOperators, API timeouts, webhook failures. Start with
oc get coand check the operator namespace. - Infrastructure layer (nodes, networking, storage): Symptoms are node-level — NotReady nodes, PVC stuck in Pending, cross-node connectivity failures. Start with
oc get nodesandoc describe node.
If unsure which layer: run oc get co && oc get nodes && oc get pods -A --field-selector status.phase!=Running,status.phase!=Succeeded — whichever returns problems first tells you the layer.
Failure Modes and Diagnosis Priority
CrashLoopBackOff
- Check
oc logs <pod> --previousFIRST — current container logs may be from a restart that hasn't crashed yet - If logs are empty, the container dies before the app starts — check image entrypoint and command, not application code
- If logs show OOMKilled, check container memory limits vs actual usage — the limit may be too low for the workload
- Check liveness probe — an aggressive probe kills healthy-but-slow-starting containers. Look for
failureThresholdandinitialDelaySeconds - Check ConfigMap/Secret mounts — a missing mount causes immediate crash with no useful log
Pending Pods
- Run
oc describe podand read the Events section. If there are NO events at all, the scheduler never attempted placement — this is almost always insufficient resources or a nodeSelector/affinity that matches zero nodes - If events mention "FailedScheduling" with "Insufficient cpu/memory" — check
oc adm top nodesfor actual vs allocatable - If events mention taints — check tolerations on the pod spec, not just node taints
- If the pod has a PVC — check if the PVC itself is Pending (
oc get pvc). A Pending PVC blocks the pod indefinitely with no obvious error on the pod
ImagePullBackOff
Differentiate three distinct causes:
- Authentication failure (403/401): Image exists but pull secret is wrong or missing. Check if the secret is linked to the pod's service account with
--for=pull. - Image not found (404/manifest unknown): Wrong tag, wrong registry, or image was deleted. Verify the exact image:tag exists.
- Network/registry unreachable (timeout/connection refused): The node can't reach the registry. This is infrastructure-layer — check node DNS and proxy settings.
Networking Failures
- Service has no endpoints: Service selector doesn't match pod labels — this is the #1 cause of "connection refused" between services
- NetworkPolicy blocking traffic: The default-deny pattern catches people — if ANY NetworkPolicy selects a pod, only explicitly allowed traffic gets through
- DNS resolution fails: Check if CoreDNS pods are running in
openshift-dns. If they're crashlooping, nothing in the cluster can resolve service names - Route returns 503: The backend pods exist but aren't Ready — check readiness probes, not the Route or Ingress config
Gotchas
oc get events -A --sort-by='.lastTimestamp'is your best first move when you don't know what's wrong — events expire after 1 hour by default, so check early- A pod in "Completed" status is NOT a failure — Jobs and init containers produce Completed pods normally. Only flag it if the pod should be long-running
oc adm toprequires metrics-server. If it returns "metrics not available," the monitoring stack has a problem — that's a platform-layer issue, not whatever you were originally debugging
When to Use Sibling Skills
- Degraded ClusterOperator → use openshift-operator-troubleshooting
- Node NotReady or disk/memory pressure → use openshift-node-operations
- Issue appeared during or after a cluster upgrade → use openshift-cluster-upgrade