name: kubernetes-debug description: Kubernetes debugging methodology and scripts. Use for pod crashes, CrashLoopBackOff, OOMKilled, deployment issues, resource problems, or container failures. allowed-tools: Bash(python *)
Kubernetes Debugging
Core Principle: Events Before Logs
ALWAYS check pod events BEFORE logs. Events explain 80% of issues faster:
- OOMKilled → Memory limit exceeded
- ImagePullBackOff → Image not found or auth issue
- FailedScheduling → No nodes with enough resources
- CrashLoopBackOff → Container crashing repeatedly
Available Scripts
All scripts are in .claude/skills/infrastructure-kubernetes/scripts/
list_pods.py - List pods with status
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>]
# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n default
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n default --label app=myapp
get_events.py - Get pod events (USE FIRST!)
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace>
# Example:
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py my-pod-7f8b9c6d5-x2k4m -n default
get_logs.py - Get pod logs
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME]
# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py my-pod-7f8b9c6d5-x2k4m -n default --tail 100
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py my-pod-7f8b9c6d5-x2k4m -n default --container mycontainer
describe_pod.py - Detailed pod info
python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace>
get_resources.py - Resource usage vs limits
python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>
Debugging Workflows
Pod Not Starting (Pending/CrashLoopBackOff)
list_pods.py- Check pod statusget_events.py- Look for scheduling/pull/crash eventsdescribe_pod.py- Check conditions and container statesget_logs.py- Only if events don't explain
Pod Restarting (OOMKilled/Crashes)
get_events.py- Check for OOMKilled or error eventsget_resources.py- Compare usage vs limitsget_logs.py- Check for errors before crashdescribe_pod.py- Check restart count and state
Deployment Not Progressing
list_pods.py- Find stuck podsget_events.py- Check events on stuck podsdescribe_pod.py- Check conditions for cluesget_resources.py- Check if resource constraints are blocking
Common Issues & Solutions
| Event Reason | Meaning | Action |
|---|---|---|
| OOMKilled | Container exceeded memory limit | Increase limits or fix memory leak |
| ImagePullBackOff | Can't pull image | Check image name, registry auth |
| CrashLoopBackOff | Container keeps crashing | Check logs for startup errors |
| FailedScheduling | No node can run pod | Check node resources, taints |
| Unhealthy | Liveness probe failed | Check probe config, app health |
Output Format
When reporting findings, use this structure:
## Kubernetes Analysis
**Pod**: <name>
**Namespace**: <namespace>
**Status**: <phase> (Restarts: N)
### Events
- [timestamp] <reason>: <message>
### Issues Found
1. [Issue description with evidence]
### Root Cause Hypothesis
[Based on events and logs]
### Recommended Action
[Specific remediation step]