name: openshift-cli description: > Investigate incidents on OpenShift clusters using the oc CLI. Use when debugging pod failures, node issues, cluster health problems, or any OpenShift infrastructure incident. Executes oc commands directly and produces structured investigation reports. allowed-tools: - Bash - Read - Write - Grep
OpenShift CLI Incident Investigation
You investigate incidents on OpenShift clusters by executing oc commands and analyzing output. Follow the phased workflow below systematically.
Prerequisites
Before starting any investigation, verify the environment:
- Check
ocis installed:which oc- If missing, tell the user: "The
ocCLI is not installed. Install it from https://mirror.openshift.com/pub/openshift-v4/clients/ocp/latest/ or runbrew install openshift-clion macOS."
- If missing, tell the user: "The
- Check login status:
oc whoami- If not logged in, ask the user for cluster URL and credentials, then run
oc login.
- If not logged in, ask the user for cluster URL and credentials, then run
- Confirm project context:
oc project- If the user wants a specific project, run
oc project <name>.
- If the user wants a specific project, run
If any prerequisite fails, resolve it before proceeding. Do not skip this step.
Phase 1: Triage
Get a broad picture of cluster health. Run these commands and analyze the output:
1.1 Identity and Context
Run:
oc whoamioc projectoc version
Note the cluster version and current project.
1.2 Node Health
Run:
oc get nodes -o wideoc adm top nodes
Look for:
- Nodes in
NotReadyorSchedulingDisabledstate - Nodes with high CPU or memory utilization (>85%)
- Uneven resource distribution across nodes
1.3 Pod Health
Run:
oc get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers
Look for:
CrashLoopBackOff— container crashing repeatedlyImagePullBackOff— image registry or reference issuePending— scheduling problem (resource constraints, node affinity, taints)Error— container exited with errorTerminatingstuck — finalizer or PDB issue
1.4 Recent Events
Run:
oc get events --sort-by=.lastTimestamp -n <namespace>
If investigating cluster-wide, also check key namespaces:
oc get events --sort-by=.lastTimestamp -n openshift-kube-apiserveroc get events --sort-by=.lastTimestamp -n openshift-etcd
Look for: Warning events, repeated events, FailedScheduling, FailedMount, BackOff, Unhealthy, OOMKilled.
1.5 Project Overview
Run:
oc status -v
This shows deployments, services, routes, and warnings in the current project.
Triage Summary
After running Phase 1, summarize:
- Which nodes are healthy vs. unhealthy
- Which pods are in abnormal states and in which namespaces
- What events suggest about the timeline and cause
- What to investigate deeper in Phase 2
Phase 2: Deep Investigation
Based on Phase 1 findings, drill into specific failing resources. Choose the relevant subsections.
2.1 Pod Failures
For each failing pod identified in Phase 1:
Run oc describe pod <pod-name> -n <namespace>
Check the Events section at the bottom, Conditions, and Status fields.
Get logs:
oc logs <pod-name> -n <namespace>oc logs <pod-name> -n <namespace> --previous(logs from crashed container)oc logs <pod-name> -n <namespace> -c <container>(specific container in multi-container pod)oc logs <pod-name> -n <namespace> --timestamps
If the pod is running but misbehaving:
oc exec <pod-name> -n <namespace> -- cat /etc/resolv.conf(DNS config)oc exec <pod-name> -n <namespace> -- env(environment vars)oc exec <pod-name> -n <namespace> -- ls -la /app(filesystem check)oc exec <pod-name> -n <namespace> -- curl -s localhost:8080/health(health endpoint)
2.2 Deployment Issues
Run:
oc get dc <name> -o yaml -n <namespace>(DeploymentConfig)oc get deployment <name> -o yaml -n <namespace>(Deployment)oc describe dc/<name> -n <namespace>oc rollout history dc/<name> -n <namespace>
Check: replicas desired vs. available, image references, resource limits, volume mounts, environment variables.
2.3 Node Issues
For nodes identified as unhealthy in Phase 1:
Run oc describe node <node-name>
Look for:
Conditionssection — MemoryPressure, DiskPressure, PIDPressure, ReadyAllocatablevsCapacity— resource headroomNon-terminated Pods— what's running on this nodeTaints— anything preventing scheduling
2.4 Service and Route Issues
Run:
oc get svc -n <namespace>oc describe svc/<name> -n <namespace>oc get endpoints <name> -n <namespace>(are pods registered?)oc get route -n <namespace>oc describe route/<name> -n <namespace>
Check: endpoint list matches expected pods, route host and TLS config, service selector matches pod labels.
2.5 Storage Issues
Run:
oc get pvc -n <namespace>oc describe pvc/<name> -n <namespace>oc get pv
Check: PVC status (Bound vs Pending), storage class, access modes, capacity.
2.6 Remote Shell and Debug Pod
For interactive access to a running container:
oc rsh <pod-name> -n <namespace>(open a remote shell session)
When you need a debug pod (clean environment, not the running container):
oc debug deployment/<name> -n <namespace>oc debug deployment/<name> -n <namespace> --as-root(if root access needed)oc debug node/<node-name>(debug at the node level)
Investigation Summary
After Phase 2, document:
- Root cause hypothesis (or top 2-3 candidates)
- Evidence supporting each hypothesis
- What additional context is needed from Phase 3
Phase 3: Context Gathering
Broaden the investigation to cluster-level systems and security context.
3.1 Cluster Operator Health
Run oc get clusteroperators
Look for operators with AVAILABLE=False, DEGRADED=True, or PROGRESSING=True. For any unhealthy operator:
Run oc describe clusteroperator <name>
3.2 Cluster Debugging Data
For deeper operator investigation:
Run oc adm inspect clusteroperator/<name> --dest-dir=./inspect-data
This collects logs, events, and resource definitions for the operator. Review the output files.
3.3 Node-Level Logs
Run:
oc adm node-logs <node-name> -u kubelet(kubelet logs)oc adm node-logs <node-name> -u crio(container runtime logs)oc adm node-logs <node-name> --path=journal(full journal)
3.4 Resource Utilization
Run:
oc adm top pods -n <namespace> --containers(per-container CPU/memory)oc adm top pods --all-namespaces --sort-by=memory(cluster-wide memory hogs)
3.5 Security Context Review
If pods fail due to permission issues:
Run:
oc get sccoc describe scc <name>oc adm policy scc-subject-review -f <pod-spec.yaml>oc auth can-i --list -n <namespace>oc auth can-i create pods -n <namespace> --as=system:serviceaccount:<namespace>:<sa-name>
3.6 Network and DNS
Run:
oc get networkpolicy -n <namespace>oc describe networkpolicy <name> -n <namespace>oc get dns.operator/default -o yaml
3.7 Must-Gather (Comprehensive)
For escalation-level investigation:
Run oc adm must-gather --dest-dir=./must-gather-data
This launches pods that collect comprehensive debugging data across the cluster. Only run this when other phases haven't been sufficient.
Context Summary
After Phase 3, update your hypothesis:
- Does cluster operator status explain the issue?
- Are node-level logs revealing?
- Is this a security/RBAC issue?
- Is this a resource exhaustion issue?
Phase 4: Report
Produce a structured investigation report. Create the investigation-reports directory if it doesn't exist:
Run mkdir -p ./investigation-reports
Save the report to ./investigation-reports/YYYY-MM-DD-<topic>.md using this template:
Report Template
# Incident Investigation Report
**Date:** YYYY-MM-DD
**Investigator:** [agent]
**Cluster:** [cluster URL from oc whoami --show-server]
**Project/Namespace:** [namespace]
## Summary
[2-3 sentence summary of what happened and the root cause]
## Timeline
| Time | Event | Source |
|------|-------|--------|
| ... | ... | oc get events / oc logs |
## Affected Resources
| Resource | Namespace | Status | Issue |
|----------|-----------|--------|-------|
| pod/xxx | ns | CrashLoopBackOff | OOMKilled |
## Root Cause
[Detailed explanation of the root cause with evidence]
## Evidence
[Key command outputs that support the root cause determination]
## Recommended Next Steps
1. [Immediate action]
2. [Follow-up action]
3. [Preventive measure]
Write the report and inform the user of its location.
Guardrails
- Read-only commands (
get,describe,logs,adm top,adm inspect,auth can-i): Execute freely without asking. - Destructive or state-changing commands (
delete,drain,cordon,scale,rollback,restart): Always ask the user for confirmation before running.
Adaptive Behavior
- If Phase 1 reveals node issues: Focus Phase 2 on node debugging (2.3), skip pod-specific steps unless pods are also affected.
- If Phase 1 reveals only pod failures in one namespace: Skip cluster-wide checks, focus on that namespace's pods (2.1), deployments (2.2), and services (2.4).
- If Phase 2 suggests security/permission issues: Prioritize Phase 3 security context review (3.5).
- If the user provides a specific pod/deployment name: Skip Phase 1 triage and go directly to Phase 2 for that resource, then expand to Phase 1 if the root cause isn't clear.
- If
occommands fail with permission errors: Runoc auth can-i --listto understand available permissions and adapt commands accordingly.