k8s-namespace-troubleshooting - SKILL.md Agent Skill

name: k8s-namespace-troubleshooting description: Use when a user reports a problem in a specific namespace, when investigating unhealthy workloads or pod failures, when checking namespace events and resource status, or when triaging application issues inside Kubernetes

Kubernetes Namespace Troubleshooting

Perform comprehensive, namespace-scoped investigation of Kubernetes workloads. Collect events, pod status, node health, resource consumption, configuration issues, and application logs, then summarise findings, diagnose root causes, and assess impact.

Keywords

kubernetes, namespace, troubleshoot, troubleshooting, debug, diagnose, investigate, problem, issue, error, unhealthy, failing, broken, pod, deployment, statefulset, daemonset, job, cronjob, service, ingress, events, logs, restart, crashloop, pending, oomkill, eviction, namespace health, application, workload

When to Use This Skill

A user says "I have a problem in namespace X" or "something is wrong in namespace X"
Pods are failing, restarting, pending, or evicted in a namespace
Services or ingresses are not responding
Applications are returning errors or degraded performance
Namespace events show warnings or errors
A general health check of everything running in a namespace is needed

Related Skills

k8s-platform-operations - Cluster-wide health checks and incident response
k8s-security-hardening - Security context and policy issues
k8s-platform-tenancy - Resource quotas and limit ranges
k8s-continual-improvement - SLOs and capacity planning
k8s-security-redteam - Security-related namespace issues
flux-troubleshooting - GitOps reconciliation failures
Shared: Pod Security Context
Shared: Network Policies

Quick Reference

Task	Command
All resources in namespace	`kubectl get all -n ${NS}`
Warning events	`kubectl get events -n ${NS} --field-selector type=Warning --sort-by='.lastTimestamp'`
Unhealthy pods	`kubectl get pods -n ${NS} \| grep -Ev 'Running\|Completed'`
Pod resource usage	`kubectl top pods -n ${NS} --sort-by=memory`
Recent pod logs	`kubectl logs -n ${NS} deploy/${NAME} --tail=100`
Describe failing pod	`kubectl describe pod ${POD} -n ${NS}`

Investigation Workflow

This skill follows a five-phase workflow. Each phase builds on the previous one.

Phase 1: COLLECT     → Gather all namespace data (resources, events, status)
    ↓
Phase 2: SUMMARISE   → Produce a namespace health summary
    ↓
Phase 3: DIAGNOSE    → Identify root causes from collected evidence
    ↓
Phase 4: IMPACT      → Assess severity and blast radius
    ↓
Phase 5: RECOMMEND   → Present recommended actions for the user

Phase 1: Collect

Gather comprehensive data from the namespace. Run all commands targeting the user-provided namespace. Collect the output of every section below before moving to Phase 2.

1.1 Namespace Status and Configuration

# Confirm namespace exists and check labels/annotations
kubectl get namespace ${NS} -o yaml

# Check resource quotas (may explain scheduling failures)
kubectl get resourcequota -n ${NS}
kubectl describe resourcequota -n ${NS}

# Check limit ranges (may explain OOMKills or throttling)
kubectl get limitrange -n ${NS}
kubectl describe limitrange -n ${NS}

1.2 Events (Critical — Always Check First)

# All events sorted by time (most recent last)
kubectl get events -n ${NS} --sort-by='.lastTimestamp'

# Warning events only
kubectl get events -n ${NS} --field-selector type=Warning --sort-by='.lastTimestamp'

# Event counts to spot recurring issues
kubectl get events -n ${NS} -o json | \
  jq -r '[.items[] | {reason: .reason, name: .involvedObject.name, count: .count}] | group_by(.reason) | map({reason: .[0].reason, total: (map(.count) | add)}) | sort_by(-.total) | .[] | "\(.total)\t\(.reason)"'

1.3 Pods — Status, Restarts, and Conditions

# Full pod listing with status
kubectl get pods -n ${NS} -o wide

# Unhealthy pods (not Running or Completed)
kubectl get pods -n ${NS} | grep -Ev 'Running|Completed'

# High restart counts
kubectl get pods -n ${NS} -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .status.containerStatuses[*]}{.restartCount}{" "}{end}{"\n"}{end}' | awk '{sum=0; for(i=2;i<=NF;i++) sum+=$i; if(sum>0) print $1"\t"sum" restarts"}' | sort -t$'\t' -k2 -rn

# OOMKilled containers
kubectl get pods -n ${NS} -o json | \
  jq -r '.items[] | .metadata.name as $pod | .status.containerStatuses[]? | select(.lastState.terminated.reason == "OOMKilled") | "\($pod)\t\(.name)\tOOMKilled"'

# Pending pods — why are they not scheduled?
kubectl get pods -n ${NS} --field-selector=status.phase=Pending -o name | \
  xargs -I{} kubectl describe {} -n ${NS} | grep -A5 "Events:"

# Container image details (spot wrong tags, missing images)
kubectl get pods -n ${NS} -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .spec.containers[*]}{.image}{" "}{end}{"\n"}{end}'

1.4 Workload Controllers

# Deployments — check desired vs ready replicas
kubectl get deployments -n ${NS} -o wide

# Deployments not at full availability
kubectl get deployments -n ${NS} -o json | \
  jq -r '.items[] | select(.status.availableReplicas != .status.replicas) | "\(.metadata.name)\tavailable=\(.status.availableReplicas // 0)/\(.status.replicas)"'

# StatefulSets
kubectl get statefulsets -n ${NS} -o wide

# DaemonSets — check desired vs ready
kubectl get daemonsets -n ${NS} -o wide

# Jobs — check for failures
kubectl get jobs -n ${NS}
kubectl get jobs -n ${NS} -o json | \
  jq -r '.items[] | select(.status.failed > 0) | "\(.metadata.name)\tfailed=\(.status.failed)"'

# CronJobs — check schedule and last run
kubectl get cronjobs -n ${NS}

1.5 Services, Endpoints, and Networking

# Services
kubectl get services -n ${NS} -o wide

# Endpoints — empty endpoints mean no backing pods
kubectl get endpoints -n ${NS}
kubectl get endpoints -n ${NS} -o json | \
  jq -r '.items[] | select((.subsets // []) | length == 0) | "\(.metadata.name)\tNO ENDPOINTS"'

# Ingresses
kubectl get ingress -n ${NS}

# Network policies (may be blocking traffic)
kubectl get networkpolicies -n ${NS}

1.6 Configuration — ConfigMaps, Secrets, PVCs

# ConfigMaps
kubectl get configmaps -n ${NS}

# Secrets (names only — never print secret data)
kubectl get secrets -n ${NS}

# PersistentVolumeClaims — check bound status and capacity
kubectl get pvc -n ${NS}
kubectl get pvc -n ${NS} -o json | \
  jq -r '.items[] | select(.status.phase != "Bound") | "\(.metadata.name)\t\(.status.phase)"'

1.7 Resource Consumption

# Pod CPU and memory (requires metrics-server)
kubectl top pods -n ${NS} --sort-by=memory
kubectl top pods -n ${NS} --sort-by=cpu

# Requests vs limits vs actual (identify over/under provisioning)
kubectl get pods -n ${NS} -o json | \
  jq -r '.items[] | .metadata.name as $pod | .spec.containers[] | "\($pod)\t\(.name)\treq_cpu=\(.resources.requests.cpu // "none")\tlim_cpu=\(.resources.limits.cpu // "none")\treq_mem=\(.resources.requests.memory // "none")\tlim_mem=\(.resources.limits.memory // "none")"'

1.8 Node Health (for Nodes Running Namespace Pods)

# Find which nodes host pods in this namespace
kubectl get pods -n ${NS} -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | sort -u

# Check conditions on those nodes
kubectl get pods -n ${NS} -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | sort -u | \
  xargs -I{} sh -c 'echo "--- {} ---" && kubectl describe node {} | grep -A5 "Conditions:"'

# Node resource pressure
kubectl get pods -n ${NS} -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | sort -u | \
  xargs -I{} kubectl top node {}

1.9 Application Logs

# Current logs, previous logs, and error grep — single pass over all pods
for pod in $(kubectl get pods -n ${NS} -o jsonpath='{.items[*].metadata.name}'); do
  echo "=== ${pod} ==="
  kubectl logs -n ${NS} ${pod} --all-containers --tail=100 2>&1
  echo "--- ${pod} (previous) ---"
  kubectl logs -n ${NS} ${pod} --all-containers --previous --tail=50 2>&1 || echo "No previous logs"
  echo "--- ${pod} (errors) ---"
  kubectl logs -n ${NS} ${pod} --all-containers --tail=200 2>/dev/null | \
    grep -iE 'error|exception|fatal|panic|timeout|refused|denied|failed|oom|kill' | \
    head -20
done

1.10 HPA and VPA (Autoscaling)

# Horizontal Pod Autoscalers
kubectl get hpa -n ${NS}
kubectl describe hpa -n ${NS}

# Vertical Pod Autoscalers (if installed)
kubectl get vpa -n ${NS} 2>/dev/null

1.11 Flux / GitOps Resources (If Present)

# Check if any Flux resources target this namespace
kubectl get kustomizations.kustomize.toolkit.fluxcd.io -A -o json | \
  jq -r '.items[] | select(.spec.targetNamespace == "'${NS}'" or .metadata.namespace == "'${NS}'") | "\(.metadata.namespace)/\(.metadata.name)\tReady=\(.status.conditions[-1].status)\tMessage=\(.status.conditions[-1].message)"' 2>/dev/null

kubectl get helmreleases.helm.toolkit.fluxcd.io -n ${NS} 2>/dev/null

Phase 2: Summarise

After collecting data, produce a structured health summary. This summary must be presented to the user before deeper diagnosis.

Namespace Health Summary Template

## Namespace Health Summary: ${NS}

### Overview
- **Total pods**: X (Y running, Z unhealthy)
- **Deployments**: X/Y at desired replicas
- **StatefulSets**: X/Y at desired replicas
- **Jobs**: X succeeded, Y failed
- **Services**: X total, Y with empty endpoints
- **PVCs**: X bound, Y unbound
- **Warning events**: X in last hour

### Unhealthy Resources
| Resource | Type | Status | Issue |
|----------|------|--------|-------|
| pod-name | Pod | CrashLoopBackOff | Exit code 1, 47 restarts |
| deploy-name | Deployment | 1/3 ready | 2 pods Pending |
| pvc-name | PVC | Pending | No matching PV |

### Resource Consumption
| Pod | CPU (actual/limit) | Memory (actual/limit) | Pressure |
|-----|--------------------|-----------------------|----------|
| pod-a | 450m/500m | 490Mi/512Mi | HIGH |
| pod-b | 10m/500m | 50Mi/512Mi | LOW |

### Recent Warning Events (Top 5)
| Count | Reason | Object | Message |
|-------|--------|--------|---------|
| 23 | BackOff | pod/app-x | Back-off restarting failed container |
| 12 | FailedScheduling | pod/app-y | Insufficient memory |

### Node Health (Hosting Namespace Pods)
| Node | CPU% | Mem% | Conditions |
|------|------|------|------------|
| node-1 | 82% | 91% | MemoryPressure=True |
| node-2 | 45% | 60% | Ready=True |

Phase 3: Diagnose

Analyse the collected evidence to identify root causes. Work through this decision tree for each unhealthy resource.

Pod Diagnostic Decision Tree

Pod not Running?
├─ Pending
│   ├─ "Insufficient cpu/memory" → Node capacity or resource quota exhausted
│   ├─ "no nodes match selector" → Node affinity/selector mismatch
│   ├─ "persistentvolumeclaim not found" → PVC missing or unbound
│   ├─ "0/N nodes are available" → Check taints, tolerations, affinity
│   └─ No events at all → Scheduler may be down
├─ CrashLoopBackOff
│   ├─ Exit code 1 → Application error (check logs)
│   ├─ Exit code 126/127 → Command not found (wrong image or entrypoint)
│   ├─ Exit code 137 → OOMKilled (raise memory limits)
│   ├─ Exit code 139 → Segfault (application bug or incompatible image)
│   └─ Exit code 143 → SIGTERM (graceful shutdown issue)
├─ ImagePullBackOff
│   ├─ "not found" → Wrong image name or tag
│   ├─ "unauthorized" → Missing or wrong imagePullSecret
│   └─ "timeout" → Registry unreachable (network/DNS)
├─ Init:Error / Init:CrashLoopBackOff
│   └─ Init container failing → Check init container logs separately
├─ CreateContainerConfigError
│   ├─ "secret not found" → Referenced Secret missing
│   └─ "configmap not found" → Referenced ConfigMap missing
├─ Evicted
│   └─ Node under pressure → Check node conditions, DiskPressure/MemoryPressure
└─ Terminating (stuck)
    ├─ Finalizer blocking deletion → Check finalizers
    └─ Process not responding to SIGTERM → Check graceful shutdown

Service Connectivity Diagnostic

Service not reachable?
├─ Endpoints empty?
│   ├─ Yes → Label selector mismatch between Service and Pods
│   └─ No → Endpoints exist, check:
│       ├─ Port mismatch (service port vs container port)
│       ├─ Network policy blocking traffic
│       ├─ Pod readiness probe failing
│       └─ DNS resolution issue
├─ Ingress not routing?
│   ├─ Ingress class missing or wrong
│   ├─ TLS certificate issue
│   ├─ Backend service name/port mismatch
│   └─ Ingress controller not running

Common Root Cause Patterns

Symptom	Likely Root Cause	Evidence to Confirm
All pods Pending	ResourceQuota exhausted	`kubectl describe resourcequota -n ${NS}` shows at limit
Pods OOMKilled	Memory limits too low	Container `lastState.terminated.reason == OOMKilled`
Pods CrashLooping	Application error	Exit code 1 + error in logs
Pods ImagePullBackOff	Wrong image/tag or missing secret	Event message contains "not found" or "unauthorized"
Service 503s	No ready endpoints	`kubectl get endpoints` shows empty subsets
PVC Pending	No matching PV or StorageClass	Event "no persistent volumes available"
Intermittent failures	Node under pressure	Node conditions show pressure flags
Slow response	CPU throttling	CPU usage near limits, high throttle counts
Connection refused	Container port mismatch	Service targetPort != container port
Pods scheduled but not starting	ConfigMap/Secret missing	CreateContainerConfigError in events

Phase 4: Impact Assessment

For each identified problem, assess its severity and blast radius.

Severity Classification

Severity	Criteria	Response
Critical	All replicas down, data loss risk, entire namespace non-functional	Immediate action required
High	Degraded availability, partial outage, >50% replicas unhealthy	Action within 1 hour
Medium	Single replica down (redundancy covering), elevated restarts	Action within 4 hours
Low	Warning events, non-critical resource pressure, cosmetic issues	Action within 24 hours

Impact Assessment Format

For each problem, present: Problem | Severity | Affected Resources | User Impact | Blast Radius

Phase 5: Recommend

For each diagnosed problem, present recommended actions for the user to execute. Do not perform remediation directly.

Recommended Actions

Present each recommendation with the problem, the suggested command or change, and who should action it:

Problem	Recommended Action	Owner
Deployment stuck or degraded	`kubectl rollout restart deployment/${NAME} -n ${NS}`	User
Pod stuck in Terminating	`kubectl delete pod ${POD} -n ${NS} --grace-period=0 --force`	User
Evicted pods cluttering namespace	`kubectl get pods -n ${NS} --field-selector=status.phase=Failed \| grep Evicted \| awk '{print $1}' \| xargs kubectl delete pod -n ${NS}`	User
Failed Jobs cluttering namespace	`kubectl delete jobs -n ${NS} --field-selector=status.successful=0`	User
Rollout stuck on bad ReplicaSet	`kubectl rollout undo deployment/${NAME} -n ${NS}`	User
OOMKilled	Increase memory limits in deployment manifest	User (manifest change)
ResourceQuota exhausted	Request quota increase or reduce workload	Namespace owner / platform team
Image not found	Correct image name/tag in deployment manifest	User (manifest change)
ImagePullSecret missing	`kubectl create secret docker-registry` (credentials needed)	User
PVC Pending (no PV)	Provision storage or check StorageClass availability	Platform team
Node pressure	Scale cluster or redistribute workloads	Platform team
Network policy blocking	Review and update NetworkPolicy rules	User / security team
CPU throttling	Increase CPU limits or optimise application	User (manifest or code change)
Readiness probe failing	Fix probe configuration or application health endpoint	User (manifest or code change)
Ingress misconfigured	Correct ingress rules, TLS config, or backend service ref	User (manifest change)
ConfigMap/Secret missing	Create the missing ConfigMap or Secret resource	User (may need values from team)
CronJob schedule wrong	Update cron expression in CronJob manifest	User (manifest change)
Stuck finalizer	Remove finalizer from resource (understand why it exists first)	User

Recommendation Report Format

Structure the final report with two sections:

Recommended Actions — What the user should do (Priority, Action, Command/Change, Owner)
Preventive Recommendations — PDBs, resource limits, probes, HPA, right-sizing

Common Error Patterns — Quick Diagnosis

A quick-lookup table for when the user describes a specific symptom:

User Says	Check First	Command
"Pods keep restarting"	Restart counts and OOMKill	`kubectl get pods -n ${NS}` + describe top restarter
"App is slow"	CPU throttling and resource usage	`kubectl top pods -n ${NS}` + check limits
"Can't connect to service"	Endpoints and network policies	`kubectl get endpoints -n ${NS}`
"Pods won't start"	Events for scheduling failures	`kubectl get events -n ${NS} --field-selector type=Warning`
"Deployment stuck"	Rollout status	`kubectl rollout status deploy/${NAME} -n ${NS}`
"Out of space"	PVC usage and node disk pressure	`kubectl get pvc -n ${NS}` + node conditions
"Permission denied"	RBAC and security context	`kubectl auth can-i --list -n ${NS}`
"Image pull error"	Image name and pull secrets	`kubectl describe pod ${POD} -n ${NS}`
"Namespace feels broken"	Full Phase 1 collection	Run all collection commands

MCP Tools Available

When the appropriate MCP servers are connected, prefer these over raw kubectl where available:

mcp__flux-operator-mcp__get_kubernetes_resources - Query any Kubernetes resource
mcp__flux-operator-mcp__get_kubernetes_logs - Retrieve pod logs
mcp__flux-operator-mcp__get_kubernetes_metrics - Get resource consumption metrics

When MCP tools are available, follow the same Phase 1 collection order using MCP equivalents (get_kubernetes_resources for each resource kind, get_kubernetes_metrics for usage, get_kubernetes_logs for targeted pod logs).

Common Mistakes

Mistake	Why It Fails	Instead
Jumping to diagnosis after checking only pods	Misses node pressure, quota limits, or missing ConfigMaps as root cause	Complete all Phase 1 sections before diagnosing
Force-deleting pods without understanding the failure	Pod respawns with the same error; masks the evidence	Collect logs and describe the pod first, then recommend action to the user
Ignoring node-level causes	Pod evictions and scheduling failures are invisible at namespace level	Always run section 1.8 (Node Health)
Decoding and printing Secrets	Leaks credentials into terminal history and agent context	List secret names only — never base64-decode
Treating all problems as equal severity	Wastes time on low-impact issues while critical ones persist	Classify every finding through Phase 4 before acting
Restarting deployments as a first resort	Hides the real problem; restarts don't fix OOMKills, bad images, or missing config	Diagnose root cause first, then recommend restart to user if appropriate

Behavioural Guidelines

When using this skill, follow these principles:

Always collect before concluding — Run Phase 1 fully. Do not guess root causes from partial data.
Present the summary first — Give the user the Phase 2 health summary before diving into diagnosis. Let them see the landscape.
Be explicit about severity — Every identified problem must have a severity rating from Phase 4.
Recommend, don't remediate — Present recommended actions with commands for the user to execute. Do not perform remediation directly.
Never print secret values — List secret names only. Never decode or display secret data.
Respect the namespace boundary — This is a namespace-scoped investigation. Only look at nodes or cluster resources when they directly affect this namespace's workloads.
Handle empty namespaces — If the namespace has no resources, say so. It may be newly created, or everything may have been deleted.
Handle large namespaces — If there are >50 pods, focus on unhealthy resources first. Summarise healthy resources in aggregate.
Correlate across resources — A pod failure may be caused by a node issue, a missing secret, or a quota limit. Cross-reference findings.
Recommend prevention — After diagnosing issues, suggest probes, PDBs, resource limits, and monitoring to prevent recurrence.