name: k8s-namespace-troubleshooting description: Use when a user reports a problem in a specific namespace, when investigating unhealthy workloads or pod failures, when checking namespace events and resource status, or when triaging application issues inside Kubernetes
Kubernetes Namespace Troubleshooting
Perform comprehensive, namespace-scoped investigation of Kubernetes workloads. Collect events, pod status, node health, resource consumption, configuration issues, and application logs, then summarise findings, diagnose root causes, and assess impact.
Keywords
kubernetes, namespace, troubleshoot, troubleshooting, debug, diagnose, investigate, problem, issue, error, unhealthy, failing, broken, pod, deployment, statefulset, daemonset, job, cronjob, service, ingress, events, logs, restart, crashloop, pending, oomkill, eviction, namespace health, application, workload
When to Use This Skill
- A user says "I have a problem in namespace X" or "something is wrong in namespace X"
- Pods are failing, restarting, pending, or evicted in a namespace
- Services or ingresses are not responding
- Applications are returning errors or degraded performance
- Namespace events show warnings or errors
- A general health check of everything running in a namespace is needed
Related Skills
- k8s-platform-operations - Cluster-wide health checks and incident response
- k8s-security-hardening - Security context and policy issues
- k8s-platform-tenancy - Resource quotas and limit ranges
- k8s-continual-improvement - SLOs and capacity planning
- k8s-security-redteam - Security-related namespace issues
- flux-troubleshooting - GitOps reconciliation failures
- Shared: Pod Security Context
- Shared: Network Policies
Quick Reference
| Task | Command |
|---|---|
| All resources in namespace | kubectl get all -n ${NS} |
| Warning events | kubectl get events -n ${NS} --field-selector type=Warning --sort-by='.lastTimestamp' |
| Unhealthy pods | kubectl get pods -n ${NS} | grep -Ev 'Running|Completed' |
| Pod resource usage | kubectl top pods -n ${NS} --sort-by=memory |
| Recent pod logs | kubectl logs -n ${NS} deploy/${NAME} --tail=100 |
| Describe failing pod | kubectl describe pod ${POD} -n ${NS} |
Investigation Workflow
This skill follows a five-phase workflow. Each phase builds on the previous one.
Phase 1: COLLECT → Gather all namespace data (resources, events, status)
↓
Phase 2: SUMMARISE → Produce a namespace health summary
↓
Phase 3: DIAGNOSE → Identify root causes from collected evidence
↓
Phase 4: IMPACT → Assess severity and blast radius
↓
Phase 5: RECOMMEND → Present recommended actions for the user
Phase 1: Collect
Gather comprehensive data from the namespace. Run all commands targeting the user-provided namespace. Collect the output of every section below before moving to Phase 2.
1.1 Namespace Status and Configuration
# Confirm namespace exists and check labels/annotations
kubectl get namespace ${NS} -o yaml
# Check resource quotas (may explain scheduling failures)
kubectl get resourcequota -n ${NS}
kubectl describe resourcequota -n ${NS}
# Check limit ranges (may explain OOMKills or throttling)
kubectl get limitrange -n ${NS}
kubectl describe limitrange -n ${NS}
1.2 Events (Critical — Always Check First)
# All events sorted by time (most recent last)
kubectl get events -n ${NS} --sort-by='.lastTimestamp'
# Warning events only
kubectl get events -n ${NS} --field-selector type=Warning --sort-by='.lastTimestamp'
# Event counts to spot recurring issues
kubectl get events -n ${NS} -o json | \
jq -r '[.items[] | {reason: .reason, name: .involvedObject.name, count: .count}] | group_by(.reason) | map({reason: .[0].reason, total: (map(.count) | add)}) | sort_by(-.total) | .[] | "\(.total)\t\(.reason)"'
1.3 Pods — Status, Restarts, and Conditions
# Full pod listing with status
kubectl get pods -n ${NS} -o wide
# Unhealthy pods (not Running or Completed)
kubectl get pods -n ${NS} | grep -Ev 'Running|Completed'
# High restart counts
kubectl get pods -n ${NS} -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .status.containerStatuses[*]}{.restartCount}{" "}{end}{"\n"}{end}' | awk '{sum=0; for(i=2;i<=NF;i++) sum+=$i; if(sum>0) print $1"\t"sum" restarts"}' | sort -t$'\t' -k2 -rn
# OOMKilled containers
kubectl get pods -n ${NS} -o json | \
jq -r '.items[] | .metadata.name as $pod | .status.containerStatuses[]? | select(.lastState.terminated.reason == "OOMKilled") | "\($pod)\t\(.name)\tOOMKilled"'
# Pending pods — why are they not scheduled?
kubectl get pods -n ${NS} --field-selector=status.phase=Pending -o name | \
xargs -I{} kubectl describe {} -n ${NS} | grep -A5 "Events:"
# Container image details (spot wrong tags, missing images)
kubectl get pods -n ${NS} -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .spec.containers[*]}{.image}{" "}{end}{"\n"}{end}'
1.4 Workload Controllers
# Deployments — check desired vs ready replicas
kubectl get deployments -n ${NS} -o wide
# Deployments not at full availability
kubectl get deployments -n ${NS} -o json | \
jq -r '.items[] | select(.status.availableReplicas != .status.replicas) | "\(.metadata.name)\tavailable=\(.status.availableReplicas // 0)/\(.status.replicas)"'
# StatefulSets
kubectl get statefulsets -n ${NS} -o wide
# DaemonSets — check desired vs ready
kubectl get daemonsets -n ${NS} -o wide
# Jobs — check for failures
kubectl get jobs -n ${NS}
kubectl get jobs -n ${NS} -o json | \
jq -r '.items[] | select(.status.failed > 0) | "\(.metadata.name)\tfailed=\(.status.failed)"'
# CronJobs — check schedule and last run
kubectl get cronjobs -n ${NS}
1.5 Services, Endpoints, and Networking
# Services
kubectl get services -n ${NS} -o wide
# Endpoints — empty endpoints mean no backing pods
kubectl get endpoints -n ${NS}
kubectl get endpoints -n ${NS} -o json | \
jq -r '.items[] | select((.subsets // []) | length == 0) | "\(.metadata.name)\tNO ENDPOINTS"'
# Ingresses
kubectl get ingress -n ${NS}
# Network policies (may be blocking traffic)
kubectl get networkpolicies -n ${NS}
1.6 Configuration — ConfigMaps, Secrets, PVCs
# ConfigMaps
kubectl get configmaps -n ${NS}
# Secrets (names only — never print secret data)
kubectl get secrets -n ${NS}
# PersistentVolumeClaims — check bound status and capacity
kubectl get pvc -n ${NS}
kubectl get pvc -n ${NS} -o json | \
jq -r '.items[] | select(.status.phase != "Bound") | "\(.metadata.name)\t\(.status.phase)"'
1.7 Resource Consumption
# Pod CPU and memory (requires metrics-server)
kubectl top pods -n ${NS} --sort-by=memory
kubectl top pods -n ${NS} --sort-by=cpu
# Requests vs limits vs actual (identify over/under provisioning)
kubectl get pods -n ${NS} -o json | \
jq -r '.items[] | .metadata.name as $pod | .spec.containers[] | "\($pod)\t\(.name)\treq_cpu=\(.resources.requests.cpu // "none")\tlim_cpu=\(.resources.limits.cpu // "none")\treq_mem=\(.resources.requests.memory // "none")\tlim_mem=\(.resources.limits.memory // "none")"'
1.8 Node Health (for Nodes Running Namespace Pods)
# Find which nodes host pods in this namespace
kubectl get pods -n ${NS} -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | sort -u
# Check conditions on those nodes
kubectl get pods -n ${NS} -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | sort -u | \
xargs -I{} sh -c 'echo "--- {} ---" && kubectl describe node {} | grep -A5 "Conditions:"'
# Node resource pressure
kubectl get pods -n ${NS} -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | sort -u | \
xargs -I{} kubectl top node {}
1.9 Application Logs
# Current logs, previous logs, and error grep — single pass over all pods
for pod in $(kubectl get pods -n ${NS} -o jsonpath='{.items[*].metadata.name}'); do
echo "=== ${pod} ==="
kubectl logs -n ${NS} ${pod} --all-containers --tail=100 2>&1
echo "--- ${pod} (previous) ---"
kubectl logs -n ${NS} ${pod} --all-containers --previous --tail=50 2>&1 || echo "No previous logs"
echo "--- ${pod} (errors) ---"
kubectl logs -n ${NS} ${pod} --all-containers --tail=200 2>/dev/null | \
grep -iE 'error|exception|fatal|panic|timeout|refused|denied|failed|oom|kill' | \
head -20
done
1.10 HPA and VPA (Autoscaling)
# Horizontal Pod Autoscalers
kubectl get hpa -n ${NS}
kubectl describe hpa -n ${NS}
# Vertical Pod Autoscalers (if installed)
kubectl get vpa -n ${NS} 2>/dev/null
1.11 Flux / GitOps Resources (If Present)
# Check if any Flux resources target this namespace
kubectl get kustomizations.kustomize.toolkit.fluxcd.io -A -o json | \
jq -r '.items[] | select(.spec.targetNamespace == "'${NS}'" or .metadata.namespace == "'${NS}'") | "\(.metadata.namespace)/\(.metadata.name)\tReady=\(.status.conditions[-1].status)\tMessage=\(.status.conditions[-1].message)"' 2>/dev/null
kubectl get helmreleases.helm.toolkit.fluxcd.io -n ${NS} 2>/dev/null
Phase 2: Summarise
After collecting data, produce a structured health summary. This summary must be presented to the user before deeper diagnosis.
Namespace Health Summary Template
## Namespace Health Summary: ${NS}
### Overview
- **Total pods**: X (Y running, Z unhealthy)
- **Deployments**: X/Y at desired replicas
- **StatefulSets**: X/Y at desired replicas
- **Jobs**: X succeeded, Y failed
- **Services**: X total, Y with empty endpoints
- **PVCs**: X bound, Y unbound
- **Warning events**: X in last hour
### Unhealthy Resources
| Resource | Type | Status | Issue |
|----------|------|--------|-------|
| pod-name | Pod | CrashLoopBackOff | Exit code 1, 47 restarts |
| deploy-name | Deployment | 1/3 ready | 2 pods Pending |
| pvc-name | PVC | Pending | No matching PV |
### Resource Consumption
| Pod | CPU (actual/limit) | Memory (actual/limit) | Pressure |
|-----|--------------------|-----------------------|----------|
| pod-a | 450m/500m | 490Mi/512Mi | HIGH |
| pod-b | 10m/500m | 50Mi/512Mi | LOW |
### Recent Warning Events (Top 5)
| Count | Reason | Object | Message |
|-------|--------|--------|---------|
| 23 | BackOff | pod/app-x | Back-off restarting failed container |
| 12 | FailedScheduling | pod/app-y | Insufficient memory |
### Node Health (Hosting Namespace Pods)
| Node | CPU% | Mem% | Conditions |
|------|------|------|------------|
| node-1 | 82% | 91% | MemoryPressure=True |
| node-2 | 45% | 60% | Ready=True |
Phase 3: Diagnose
Analyse the collected evidence to identify root causes. Work through this decision tree for each unhealthy resource.
Pod Diagnostic Decision Tree
Pod not Running?
├─ Pending
│ ├─ "Insufficient cpu/memory" → Node capacity or resource quota exhausted
│ ├─ "no nodes match selector" → Node affinity/selector mismatch
│ ├─ "persistentvolumeclaim not found" → PVC missing or unbound
│ ├─ "0/N nodes are available" → Check taints, tolerations, affinity
│ └─ No events at all → Scheduler may be down
├─ CrashLoopBackOff
│ ├─ Exit code 1 → Application error (check logs)
│ ├─ Exit code 126/127 → Command not found (wrong image or entrypoint)
│ ├─ Exit code 137 → OOMKilled (raise memory limits)
│ ├─ Exit code 139 → Segfault (application bug or incompatible image)
│ └─ Exit code 143 → SIGTERM (graceful shutdown issue)
├─ ImagePullBackOff
│ ├─ "not found" → Wrong image name or tag
│ ├─ "unauthorized" → Missing or wrong imagePullSecret
│ └─ "timeout" → Registry unreachable (network/DNS)
├─ Init:Error / Init:CrashLoopBackOff
│ └─ Init container failing → Check init container logs separately
├─ CreateContainerConfigError
│ ├─ "secret not found" → Referenced Secret missing
│ └─ "configmap not found" → Referenced ConfigMap missing
├─ Evicted
│ └─ Node under pressure → Check node conditions, DiskPressure/MemoryPressure
└─ Terminating (stuck)
├─ Finalizer blocking deletion → Check finalizers
└─ Process not responding to SIGTERM → Check graceful shutdown
Service Connectivity Diagnostic
Service not reachable?
├─ Endpoints empty?
│ ├─ Yes → Label selector mismatch between Service and Pods
│ └─ No → Endpoints exist, check:
│ ├─ Port mismatch (service port vs container port)
│ ├─ Network policy blocking traffic
│ ├─ Pod readiness probe failing
│ └─ DNS resolution issue
├─ Ingress not routing?
│ ├─ Ingress class missing or wrong
│ ├─ TLS certificate issue
│ ├─ Backend service name/port mismatch
│ └─ Ingress controller not running
Common Root Cause Patterns
| Symptom | Likely Root Cause | Evidence to Confirm |
|---|---|---|
| All pods Pending | ResourceQuota exhausted | kubectl describe resourcequota -n ${NS} shows at limit |
| Pods OOMKilled | Memory limits too low | Container lastState.terminated.reason == OOMKilled |
| Pods CrashLooping | Application error | Exit code 1 + error in logs |
| Pods ImagePullBackOff | Wrong image/tag or missing secret | Event message contains "not found" or "unauthorized" |
| Service 503s | No ready endpoints | kubectl get endpoints shows empty subsets |
| PVC Pending | No matching PV or StorageClass | Event "no persistent volumes available" |
| Intermittent failures | Node under pressure | Node conditions show pressure flags |
| Slow response | CPU throttling | CPU usage near limits, high throttle counts |
| Connection refused | Container port mismatch | Service targetPort != container port |
| Pods scheduled but not starting | ConfigMap/Secret missing | CreateContainerConfigError in events |
Phase 4: Impact Assessment
For each identified problem, assess its severity and blast radius.
Severity Classification
| Severity | Criteria | Response |
|---|---|---|
| Critical | All replicas down, data loss risk, entire namespace non-functional | Immediate action required |
| High | Degraded availability, partial outage, >50% replicas unhealthy | Action within 1 hour |
| Medium | Single replica down (redundancy covering), elevated restarts | Action within 4 hours |
| Low | Warning events, non-critical resource pressure, cosmetic issues | Action within 24 hours |
Impact Assessment Format
For each problem, present: Problem | Severity | Affected Resources | User Impact | Blast Radius
Phase 5: Recommend
For each diagnosed problem, present recommended actions for the user to execute. Do not perform remediation directly.
Recommended Actions
Present each recommendation with the problem, the suggested command or change, and who should action it:
| Problem | Recommended Action | Owner |
|---|---|---|
| Deployment stuck or degraded | kubectl rollout restart deployment/${NAME} -n ${NS} |
User |
| Pod stuck in Terminating | kubectl delete pod ${POD} -n ${NS} --grace-period=0 --force |
User |
| Evicted pods cluttering namespace | kubectl get pods -n ${NS} --field-selector=status.phase=Failed | grep Evicted | awk '{print $1}' | xargs kubectl delete pod -n ${NS} |
User |
| Failed Jobs cluttering namespace | kubectl delete jobs -n ${NS} --field-selector=status.successful=0 |
User |
| Rollout stuck on bad ReplicaSet | kubectl rollout undo deployment/${NAME} -n ${NS} |
User |
| OOMKilled | Increase memory limits in deployment manifest | User (manifest change) |
| ResourceQuota exhausted | Request quota increase or reduce workload | Namespace owner / platform team |
| Image not found | Correct image name/tag in deployment manifest | User (manifest change) |
| ImagePullSecret missing | kubectl create secret docker-registry (credentials needed) |
User |
| PVC Pending (no PV) | Provision storage or check StorageClass availability | Platform team |
| Node pressure | Scale cluster or redistribute workloads | Platform team |
| Network policy blocking | Review and update NetworkPolicy rules | User / security team |
| CPU throttling | Increase CPU limits or optimise application | User (manifest or code change) |
| Readiness probe failing | Fix probe configuration or application health endpoint | User (manifest or code change) |
| Ingress misconfigured | Correct ingress rules, TLS config, or backend service ref | User (manifest change) |
| ConfigMap/Secret missing | Create the missing ConfigMap or Secret resource | User (may need values from team) |
| CronJob schedule wrong | Update cron expression in CronJob manifest | User (manifest change) |
| Stuck finalizer | Remove finalizer from resource (understand why it exists first) | User |
Recommendation Report Format
Structure the final report with two sections:
- Recommended Actions — What the user should do (Priority, Action, Command/Change, Owner)
- Preventive Recommendations — PDBs, resource limits, probes, HPA, right-sizing
Common Error Patterns — Quick Diagnosis
A quick-lookup table for when the user describes a specific symptom:
| User Says | Check First | Command |
|---|---|---|
| "Pods keep restarting" | Restart counts and OOMKill | kubectl get pods -n ${NS} + describe top restarter |
| "App is slow" | CPU throttling and resource usage | kubectl top pods -n ${NS} + check limits |
| "Can't connect to service" | Endpoints and network policies | kubectl get endpoints -n ${NS} |
| "Pods won't start" | Events for scheduling failures | kubectl get events -n ${NS} --field-selector type=Warning |
| "Deployment stuck" | Rollout status | kubectl rollout status deploy/${NAME} -n ${NS} |
| "Out of space" | PVC usage and node disk pressure | kubectl get pvc -n ${NS} + node conditions |
| "Permission denied" | RBAC and security context | kubectl auth can-i --list -n ${NS} |
| "Image pull error" | Image name and pull secrets | kubectl describe pod ${POD} -n ${NS} |
| "Namespace feels broken" | Full Phase 1 collection | Run all collection commands |
MCP Tools Available
When the appropriate MCP servers are connected, prefer these over raw kubectl where available:
mcp__flux-operator-mcp__get_kubernetes_resources- Query any Kubernetes resourcemcp__flux-operator-mcp__get_kubernetes_logs- Retrieve pod logsmcp__flux-operator-mcp__get_kubernetes_metrics- Get resource consumption metrics
When MCP tools are available, follow the same Phase 1 collection order using MCP equivalents (get_kubernetes_resources for each resource kind, get_kubernetes_metrics for usage, get_kubernetes_logs for targeted pod logs).
Common Mistakes
| Mistake | Why It Fails | Instead |
|---|---|---|
| Jumping to diagnosis after checking only pods | Misses node pressure, quota limits, or missing ConfigMaps as root cause | Complete all Phase 1 sections before diagnosing |
| Force-deleting pods without understanding the failure | Pod respawns with the same error; masks the evidence | Collect logs and describe the pod first, then recommend action to the user |
| Ignoring node-level causes | Pod evictions and scheduling failures are invisible at namespace level | Always run section 1.8 (Node Health) |
| Decoding and printing Secrets | Leaks credentials into terminal history and agent context | List secret names only — never base64-decode |
| Treating all problems as equal severity | Wastes time on low-impact issues while critical ones persist | Classify every finding through Phase 4 before acting |
| Restarting deployments as a first resort | Hides the real problem; restarts don't fix OOMKills, bad images, or missing config | Diagnose root cause first, then recommend restart to user if appropriate |
Behavioural Guidelines
When using this skill, follow these principles:
- Always collect before concluding — Run Phase 1 fully. Do not guess root causes from partial data.
- Present the summary first — Give the user the Phase 2 health summary before diving into diagnosis. Let them see the landscape.
- Be explicit about severity — Every identified problem must have a severity rating from Phase 4.
- Recommend, don't remediate — Present recommended actions with commands for the user to execute. Do not perform remediation directly.
- Never print secret values — List secret names only. Never decode or display secret data.
- Respect the namespace boundary — This is a namespace-scoped investigation. Only look at nodes or cluster resources when they directly affect this namespace's workloads.
- Handle empty namespaces — If the namespace has no resources, say so. It may be newly created, or everything may have been deleted.
- Handle large namespaces — If there are >50 pods, focus on unhealthy resources first. Summarise healthy resources in aggregate.
- Correlate across resources — A pod failure may be caused by a node issue, a missing secret, or a quota limit. Cross-reference findings.
- Recommend prevention — After diagnosing issues, suggest probes, PDBs, resource limits, and monitoring to prevent recurrence.