kubernetes-debug

star 1

Kubernetes debugging methodology and scripts. Use for pod crashes, CrashLoopBackOff, OOMKilled, deployment issues, resource problems, or container failures.

incidentfox By incidentfox schedule Updated 2/23/2026

name: kubernetes-debug description: Kubernetes debugging methodology and scripts. Use for pod crashes, CrashLoopBackOff, OOMKilled, deployment issues, resource problems, or container failures. allowed-tools: Bash(python *)

Kubernetes Debugging

Core Principle: Events Before Logs

ALWAYS check pod events BEFORE logs. Events explain 80% of issues faster:

  • OOMKilled → Memory limit exceeded
  • ImagePullBackOff → Image not found or auth issue
  • FailedScheduling → No nodes with enough resources
  • CrashLoopBackOff → Container crashing repeatedly

Available Scripts

All scripts are in .claude/skills/infrastructure-kubernetes/scripts/

list_pods.py - List pods with status

python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n default
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n default --label app=myapp

get_events.py - Get pod events (USE FIRST!)

python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace>

# Example:
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py my-pod-7f8b9c6d5-x2k4m -n default

get_logs.py - Get pod logs

python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py my-pod-7f8b9c6d5-x2k4m -n default --tail 100
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py my-pod-7f8b9c6d5-x2k4m -n default --container mycontainer

describe_pod.py - Detailed pod info

python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace>

get_resources.py - Resource usage vs limits

python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>

Debugging Workflows

Pod Not Starting (Pending/CrashLoopBackOff)

  1. list_pods.py - Check pod status
  2. get_events.py - Look for scheduling/pull/crash events
  3. describe_pod.py - Check conditions and container states
  4. get_logs.py - Only if events don't explain

Pod Restarting (OOMKilled/Crashes)

  1. get_events.py - Check for OOMKilled or error events
  2. get_resources.py - Compare usage vs limits
  3. get_logs.py - Check for errors before crash
  4. describe_pod.py - Check restart count and state

Deployment Not Progressing

  1. list_pods.py - Find stuck pods
  2. get_events.py - Check events on stuck pods
  3. describe_pod.py - Check conditions for clues
  4. get_resources.py - Check if resource constraints are blocking

Common Issues & Solutions

Event Reason Meaning Action
OOMKilled Container exceeded memory limit Increase limits or fix memory leak
ImagePullBackOff Can't pull image Check image name, registry auth
CrashLoopBackOff Container keeps crashing Check logs for startup errors
FailedScheduling No node can run pod Check node resources, taints
Unhealthy Liveness probe failed Check probe config, app health

Output Format

When reporting findings, use this structure:

## Kubernetes Analysis

**Pod**: <name>
**Namespace**: <namespace>
**Status**: <phase> (Restarts: N)

### Events
- [timestamp] <reason>: <message>

### Issues Found
1. [Issue description with evidence]

### Root Cause Hypothesis
[Based on events and logs]

### Recommended Action
[Specific remediation step]
Install via CLI
npx skills add https://github.com/incidentfox/self-learning-ai-agent --skill kubernetes-debug
Repository Details
star Stars 1
call_split Forks 2
navigation Branch main
article Path SKILL.md
More from Creator