k3s-cluster-health - SKILL.md Agent Skill

name: k3s-cluster-health description: "Run a comprehensive health check of the entire K3s cluster including nodes, system pods, resource usage, and overall availability. Use this for daily health audits, before maintenance, after incidents, or when investigating cluster-wide issues. Keywords: cluster, health, nodes, k3s, kubernetes, k8s, audit, overview, status, capacity, resources."

K3s Cluster Health

This skill performs a comprehensive health check of the entire K3s cluster, assessing node health, system components, workload status, and resource capacity.

When to Use

Daily/weekly cluster health audit
Before performing maintenance
After a cluster incident
Investigating cluster-wide slowness
Capacity planning
When multiple services seem affected
New cluster verification

What It Does

The skill will:

Check all node health and status
Verify control plane components
Assess cluster resource capacity
Check all namespaces for unhealthy workloads
Review recent cluster events
Identify resource pressure (CPU, memory, disk)
Check system pods (kube-system, monitoring)
Summarize overall cluster health score
Provide recommendations for issues found

Usage

Invoke: /k3s-cluster-health

No arguments required — checks the entire cluster.

Implementation

Phase 1: Node Health

Check all nodes in the cluster.

1. k8s_get_nodes()
   - List all nodes
   - Check Ready condition for each
   - Check for cordoned/unschedulable nodes
   - Note node roles (control-plane, worker)

2. For each node, check conditions:
   - Ready: Must be True
   - MemoryPressure: Should be False
   - DiskPressure: Should be False
   - PIDPressure: Should be False
   - NetworkUnavailable: Should be False

3. k8s_get_node_metrics()
   - CPU usage per node
   - Memory usage per node
   - Flag nodes over 80% utilized

Phase 2: Control Plane Health

Verify K3s system components.

4. k8s_get_pods(namespace="kube-system")
   - Check all kube-system pods are Running
   - Key pods to verify:
     - coredns
     - local-path-provisioner
     - metrics-server
     - traefik (if using default ingress)
   - Note any CrashLoopBackOff or Pending pods

5. k8s_get_cluster_info()
   - Verify API server is reachable
   - Note cluster version

Phase 3: Workload Health

Check all user workloads across namespaces.

6. k8s_get_namespaces()
   - List all namespaces
   - Note any in non-Active state

7. k8s_get_all_unhealthy()
   - Get all unhealthy pods cluster-wide
   - Get all unhealthy deployments
   - Group by namespace

8. For each namespace with issues:
   - Note which pods/deployments are affected
   - Categorize by issue type (Pending, CrashLoop, etc.)

Phase 4: Resource Analysis

Assess cluster capacity.

9. Calculate cluster-wide metrics:
   - Total CPU capacity vs requests vs usage
   - Total memory capacity vs requests vs usage
   - Number of pods vs pod limit

10. k8s_get_pods(all_namespaces=True)
    - Count total pods
    - Count by phase (Running, Pending, Failed)
    - Note scheduling pressure

11. For each node:
    - Pods scheduled on node
    - Resources consumed vs allocatable

Phase 5: Recent Events

Check for cluster-wide issues.

12. k8s_get_events(all_namespaces=True)
    - Get Warning events from last hour
    - Group by type (FailedScheduling, OOM, etc.)
    - Identify patterns

13. Look for common cluster issues:
    - Multiple FailedScheduling: Resource exhaustion
    - Multiple OOMKilled: Memory pressure
    - Multiple ImagePull errors: Registry issues
    - Multiple NodeNotReady: Network/node issues

Phase 6: Service Mesh / Networking

Check cluster networking.

14. k8s_get_services(all_namespaces=True)
    - List all services
    - Note any without endpoints

15. Check ingress (if applicable):
    - k8s_get_pods(namespace="kube-system", labels="app.kubernetes.io/name=traefik")
    - Verify ingress controller is healthy

Phase 7: Storage

Check persistent storage.

16. k8s_describe(resource_type="pvc", all_namespaces=True) or list PVCs
    - Check for Pending PVCs
    - Note any bound PVC issues

17. Check local-path-provisioner if used:
    - Verify it's running
    - Check for PV provisioning events

Health Score Calculation

Calculate an overall health score (0-100):

Component	Weight	Healthy Criteria
Nodes	30%	All Ready, no pressure conditions
Control Plane	25%	All kube-system pods Running
Workloads	25%	<5% pods unhealthy
Resources	15%	<80% cluster utilization
Events	5%	<10 Warning events/hour

Common Issues and Solutions

Node NotReady

Check kubelet logs on the node
Verify network connectivity
Check if node needs reboot

kube-system Pods Failing

CoreDNS crash: Check DNS config, resources
metrics-server: May need more memory
traefik: Check certificate/config issues

High Resource Usage

80% CPU: Add nodes or reduce workloads
80% Memory: Add nodes, check for leaks
High pod count: Check for runaway replicasets

Many Pending Pods

Resource exhaustion: Scale cluster
Node affinity: Relax constraints
PVC issues: Check storage provisioner

Expected Output

The health check should produce:

Node Summary: X/Y nodes healthy, resource usage
Control Plane: All system pods status
Workload Summary: Healthy/unhealthy by namespace
Resource Capacity: CPU, memory, pod capacity
Recent Issues: Grouped warning events
Health Score: 0-100 with breakdown
Recommendations: Prioritized action items

Example summary:

## K3s Cluster Health Report

**Overall Score: 85/100**

### Nodes (5/5 healthy)
- k3s-01 (control-plane): Ready, CPU 45%, Mem 62%
- k3s-02 (worker): Ready, CPU 38%, Mem 55%
- k3s-03 (worker): Ready, CPU 52%, Mem 71%
- k3s-04 (worker): Ready, CPU 29%, Mem 48%
- k3s-05 (worker): Ready, CPU 61%, Mem 78%

### Control Plane: Healthy
- coredns: 2/2 Running
- metrics-server: 1/1 Running
- traefik: 1/1 Running

### Workloads: 2 issues
- apps/cfoperator-abc123: CrashLoopBackOff (1 restart)
- monitoring/prometheus-0: High memory (92%)

### Recommendations
1. Investigate cfoperator pod crash
2. Consider increasing prometheus memory limit
3. k3s-05 approaching memory pressure (78%)