openshift-cli - SKILL.md Agent Skill

name: openshift-cli description: > Investigate incidents on OpenShift clusters using the oc CLI. Use when debugging pod failures, node issues, cluster health problems, or any OpenShift infrastructure incident. Executes oc commands directly and produces structured investigation reports. allowed-tools: - Bash - Read - Write - Grep

OpenShift CLI Incident Investigation

You investigate incidents on OpenShift clusters by executing oc commands and analyzing output. Follow the phased workflow below systematically.

Prerequisites

Before starting any investigation, verify the environment:

Check oc is installed: which oc
- If missing, tell the user: "The oc CLI is not installed. Install it from https://mirror.openshift.com/pub/openshift-v4/clients/ocp/latest/ or run brew install openshift-cli on macOS."
Check login status: oc whoami
- If not logged in, ask the user for cluster URL and credentials, then run oc login.
Confirm project context: oc project
- If the user wants a specific project, run oc project <name>.

If any prerequisite fails, resolve it before proceeding. Do not skip this step.

Phase 1: Triage

Get a broad picture of cluster health. Run these commands and analyze the output:

1.1 Identity and Context

Run:

oc whoami
oc project
oc version

Note the cluster version and current project.

1.2 Node Health

Run:

oc get nodes -o wide
oc adm top nodes

Look for:

Nodes in NotReady or SchedulingDisabled state
Nodes with high CPU or memory utilization (>85%)
Uneven resource distribution across nodes

1.3 Pod Health

Run:

oc get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers

Look for:

CrashLoopBackOff — container crashing repeatedly
ImagePullBackOff — image registry or reference issue
Pending — scheduling problem (resource constraints, node affinity, taints)
Error — container exited with error
Terminating stuck — finalizer or PDB issue

1.4 Recent Events

Run:

oc get events --sort-by=.lastTimestamp -n <namespace>

If investigating cluster-wide, also check key namespaces:

oc get events --sort-by=.lastTimestamp -n openshift-kube-apiserver
oc get events --sort-by=.lastTimestamp -n openshift-etcd

Look for: Warning events, repeated events, FailedScheduling, FailedMount, BackOff, Unhealthy, OOMKilled.

1.5 Project Overview

Run:

oc status -v

This shows deployments, services, routes, and warnings in the current project.

Triage Summary

After running Phase 1, summarize:

Which nodes are healthy vs. unhealthy
Which pods are in abnormal states and in which namespaces
What events suggest about the timeline and cause
What to investigate deeper in Phase 2

Phase 2: Deep Investigation

Based on Phase 1 findings, drill into specific failing resources. Choose the relevant subsections.

2.1 Pod Failures

For each failing pod identified in Phase 1:

Run oc describe pod <pod-name> -n <namespace>

Check the Events section at the bottom, Conditions, and Status fields.

Get logs:

oc logs <pod-name> -n <namespace>
oc logs <pod-name> -n <namespace> --previous (logs from crashed container)
oc logs <pod-name> -n <namespace> -c <container> (specific container in multi-container pod)
oc logs <pod-name> -n <namespace> --timestamps

If the pod is running but misbehaving:

oc exec <pod-name> -n <namespace> -- cat /etc/resolv.conf (DNS config)
oc exec <pod-name> -n <namespace> -- env (environment vars)
oc exec <pod-name> -n <namespace> -- ls -la /app (filesystem check)
oc exec <pod-name> -n <namespace> -- curl -s localhost:8080/health (health endpoint)

2.2 Deployment Issues

Run:

oc get dc <name> -o yaml -n <namespace> (DeploymentConfig)
oc get deployment <name> -o yaml -n <namespace> (Deployment)
oc describe dc/<name> -n <namespace>
oc rollout history dc/<name> -n <namespace>

Check: replicas desired vs. available, image references, resource limits, volume mounts, environment variables.

2.3 Node Issues

For nodes identified as unhealthy in Phase 1:

Run oc describe node <node-name>

Look for:

Conditions section — MemoryPressure, DiskPressure, PIDPressure, Ready
Allocatable vs Capacity — resource headroom
Non-terminated Pods — what's running on this node
Taints — anything preventing scheduling

2.4 Service and Route Issues

Run:

oc get svc -n <namespace>
oc describe svc/<name> -n <namespace>
oc get endpoints <name> -n <namespace> (are pods registered?)
oc get route -n <namespace>
oc describe route/<name> -n <namespace>

Check: endpoint list matches expected pods, route host and TLS config, service selector matches pod labels.

2.5 Storage Issues

Run:

oc get pvc -n <namespace>
oc describe pvc/<name> -n <namespace>
oc get pv

Check: PVC status (Bound vs Pending), storage class, access modes, capacity.

2.6 Remote Shell and Debug Pod

For interactive access to a running container:

oc rsh <pod-name> -n <namespace> (open a remote shell session)

When you need a debug pod (clean environment, not the running container):

oc debug deployment/<name> -n <namespace>
oc debug deployment/<name> -n <namespace> --as-root (if root access needed)
oc debug node/<node-name> (debug at the node level)

Investigation Summary

After Phase 2, document:

Root cause hypothesis (or top 2-3 candidates)
Evidence supporting each hypothesis
What additional context is needed from Phase 3

Phase 3: Context Gathering

Broaden the investigation to cluster-level systems and security context.

3.1 Cluster Operator Health

Run oc get clusteroperators

Look for operators with AVAILABLE=False, DEGRADED=True, or PROGRESSING=True. For any unhealthy operator:

Run oc describe clusteroperator <name>

3.2 Cluster Debugging Data

For deeper operator investigation:

Run oc adm inspect clusteroperator/<name> --dest-dir=./inspect-data

This collects logs, events, and resource definitions for the operator. Review the output files.

3.3 Node-Level Logs

Run:

oc adm node-logs <node-name> -u kubelet (kubelet logs)
oc adm node-logs <node-name> -u crio (container runtime logs)
oc adm node-logs <node-name> --path=journal (full journal)

3.4 Resource Utilization

Run:

oc adm top pods -n <namespace> --containers (per-container CPU/memory)
oc adm top pods --all-namespaces --sort-by=memory (cluster-wide memory hogs)

3.5 Security Context Review

If pods fail due to permission issues:

Run:

oc get scc
oc describe scc <name>
oc adm policy scc-subject-review -f <pod-spec.yaml>
oc auth can-i --list -n <namespace>
oc auth can-i create pods -n <namespace> --as=system:serviceaccount:<namespace>:<sa-name>

3.6 Network and DNS

Run:

oc get networkpolicy -n <namespace>
oc describe networkpolicy <name> -n <namespace>
oc get dns.operator/default -o yaml

3.7 Must-Gather (Comprehensive)

For escalation-level investigation:

Run oc adm must-gather --dest-dir=./must-gather-data

This launches pods that collect comprehensive debugging data across the cluster. Only run this when other phases haven't been sufficient.

Context Summary

After Phase 3, update your hypothesis:

Does cluster operator status explain the issue?
Are node-level logs revealing?
Is this a security/RBAC issue?
Is this a resource exhaustion issue?

Phase 4: Report

Produce a structured investigation report. Create the investigation-reports directory if it doesn't exist:

Run mkdir -p ./investigation-reports

Save the report to ./investigation-reports/YYYY-MM-DD-<topic>.md using this template:

Report Template

# Incident Investigation Report

**Date:** YYYY-MM-DD
**Investigator:** [agent]
**Cluster:** [cluster URL from oc whoami --show-server]
**Project/Namespace:** [namespace]

## Summary

[2-3 sentence summary of what happened and the root cause]

## Timeline

| Time | Event | Source |
|------|-------|--------|
| ... | ... | oc get events / oc logs |

## Affected Resources

| Resource | Namespace | Status | Issue |
|----------|-----------|--------|-------|
| pod/xxx | ns | CrashLoopBackOff | OOMKilled |

## Root Cause

[Detailed explanation of the root cause with evidence]

## Evidence

[Key command outputs that support the root cause determination]

## Recommended Next Steps

1. [Immediate action]
2. [Follow-up action]
3. [Preventive measure]

Write the report and inform the user of its location.

Guardrails

Read-only commands (get, describe, logs, adm top, adm inspect, auth can-i): Execute freely without asking.
Destructive or state-changing commands (delete, drain, cordon, scale, rollback, restart): Always ask the user for confirmation before running.

Adaptive Behavior

If Phase 1 reveals node issues: Focus Phase 2 on node debugging (2.3), skip pod-specific steps unless pods are also affected.
If Phase 1 reveals only pod failures in one namespace: Skip cluster-wide checks, focus on that namespace's pods (2.1), deployments (2.2), and services (2.4).
If Phase 2 suggests security/permission issues: Prioritize Phase 3 security context review (3.5).
If the user provides a specific pod/deployment name: Skip Phase 1 triage and go directly to Phase 2 for that resource, then expand to Phase 1 if the root cause isn't clear.
If oc commands fail with permission errors: Run oc auth can-i --list to understand available permissions and adapt commands accordingly.