k8shealth - SKILL.md Agent Skill

name: k8s:health description: Check comprehensive platform health including deployments, pods, services, certificates, and resources across the Kagenti platform

Platform Health Check Skill

This skill helps you perform comprehensive platform health checks and identify issues quickly.

Context-Safe Execution (MANDATORY)

All kubectl/oc commands MUST redirect output to files. Commands below are shown in bare form for readability. When executing, always redirect:

export LOG_DIR=/tmp/kagenti/k8s/${CLUSTER:-local}
mkdir -p $LOG_DIR

# Example: health check script
.github/scripts/verify_deployment.sh > $LOG_DIR/health-check.log 2>&1 && echo "OK: healthy" || echo "FAIL (see $LOG_DIR/health-check.log)"

# Example: kubectl commands
kubectl get pods -A > $LOG_DIR/all-pods.log 2>&1 && echo "OK" || echo "FAIL"
kubectl get deployments -A > $LOG_DIR/deployments.log 2>&1 && echo "OK" || echo "FAIL"

# Analyze results in subagent — NEVER read large output in main context
# Use Task(subagent_type='Explore') to read log files and return summaries

When to Use

After deployments or cluster restarts
Before making changes (baseline health)
During incident investigation
Regular health monitoring
After running tests
User asks "check platform" or "is everything working"

Quick Health Check

Automated Health Check Script

# Run the comprehensive health check (from CI)
chmod +x .github/scripts/verify_deployment.sh
.github/scripts/verify_deployment.sh

# What it checks:
# ✓ Resource usage (RAM, disk, CPU, Docker containers)
# ✓ Deployment status (weather-tool, weather-service, keycloak, operator)
# ✓ Pod health summary (running, pending, failed, crashloop)
# ✓ Failed pod details with events and error logs
# ✓ Iterates until healthy or timeout (default: 20 iterations × 15s = 5 minutes)

# Configure timeout
MAX_ITERATIONS=30 POLL_INTERVAL=20 .github/scripts/verify_deployment.sh

Expected Output:

===================================================================
  Kagenti Deployment Health Monitor
===================================================================

Configuration:
  Max Iterations: 20
  Poll Interval: 15s
  Total Timeout: 300s (5m)

━━━ Resource Usage ━━━
  Memory: 8.23/15.50 GB (53.1% used)
  Disk: 45G/234G (20% used)
  Load Avg (1/5/15m): 2.1 1.8 1.5
  Docker Containers: 12 running

━━━ Deployment Status ━━━
  ✓ weather-tool: 1/1 ready
  ✓ weather-service: 1/1 ready
  ✓ keycloak: 1/1 ready
  ✓ platform-operator: 1 ready

━━━ Pod Health Summary ━━━
  Total Pods: 45
  Running: 43
  Pending: 2

====================================================================
✓ Deployment is HEALTHY
====================================================================

Run E2E Tests

cd kagenti

# Install test dependencies (first time)
uv pip install -r tests/requirements.txt

# Run all deployment health tests
uv run pytest tests/e2e/test_deployment_health.py -v

# Run only critical tests
uv run pytest tests/e2e/test_deployment_health.py -v --only-critical

# Exclude specific apps
uv run pytest tests/e2e/test_deployment_health.py -v --exclude-app=keycloak

Tests check:

✓ No failed pods
✓ No crashlooping pods (>3 restarts)
✓ weather-tool deployment ready
✓ weather-service deployment ready
✓ Keycloak deployment ready
✓ Platform Operator ready
✓ Services have endpoints

Manual Health Checks

Quick Status Commands

# All pods across all namespaces
kubectl get pods -A

# All pods sorted by status
kubectl get pods -A --sort-by=.status.phase

# Only failing pods
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

# Pods with high restart count
kubectl get pods -A | awk '$4 > 3 {print $0}'

# All deployments
kubectl get deployments -A

# All services
kubectl get svc -A

# All namespaces
kubectl get ns

Platform Components Status

# Core platform namespaces
kubectl get pods -n kagenti-system       # Platform Operator
kubectl get pods -n keycloak              # Keycloak
kubectl get pods -n istio-system          # Istio
kubectl get pods -n spire-server          # SPIRE
kubectl get pods -n tekton-pipelines      # Tekton
kubectl get pods -n cert-manager          # Cert-Manager

# Agent namespaces
kubectl get pods -n team1                 # Team1 agents/tools
kubectl get pods -n team2                 # Team2 agents/tools

# Optional observability (if addons installed)
kubectl get pods -n observability         # Prometheus, Kiali, Phoenix

Check Specific Components

Weather Tool & Service (Demo Agents)

# Deployments
kubectl get deployment -n team1 weather-tool
kubectl get deployment -n team1 weather-service

# Pods
kubectl get pods -n team1 -l app=weather-tool
kubectl get pods -n team1 -l app=weather-service

# Services & Endpoints
kubectl get svc -n team1 weather-tool
kubectl get endpoints -n team1 weather-tool
kubectl get svc -n team1 weather-service
kubectl get endpoints -n team1 weather-service

# Check logs
kubectl logs -n team1 deployment/weather-tool --tail=50
kubectl logs -n team1 deployment/weather-service --tail=50

Keycloak (Authentication)

# Check deployment/statefulset
kubectl get deployment -n keycloak keycloak 2>/dev/null || kubectl get statefulset -n keycloak keycloak

# Check pods
kubectl get pods -n keycloak -l app=keycloak

# Check logs
kubectl logs -n keycloak deployment/keycloak --tail=50 2>/dev/null || \
kubectl logs -n keycloak statefulset/keycloak --tail=50

# Test Keycloak endpoint
kubectl exec -n keycloak deployment/keycloak -c keycloak -- \
  curl -sf http://localhost:8080/health/ready || echo "Keycloak not ready"

# Access Keycloak UI
open http://keycloak.localtest.me:8080

Platform Operator

# Check operator deployment
kubectl get deployment -n kagenti-system -l control-plane=controller-manager

# Check operator pods
kubectl get pods -n kagenti-system -l control-plane=controller-manager

# Check operator logs
kubectl logs -n kagenti-system deployment/<operator-name> --tail=100

# Check Component CRDs
kubectl get components -A

Istio Service Mesh

# Istio control plane
kubectl get pods -n istio-system

# Check sidecar injection (should show 2/2 for injected pods)
kubectl get pods -A -o wide | grep "2/2"

# Istio gateway
kubectl get gateway -A

# Virtual services
kubectl get virtualservice -A

# Destination rules
kubectl get destinationrule -A

SPIRE (Workload Identity)

# SPIRE Server
kubectl get pods -n spire-server

# SPIRE Agents (should be running on nodes)
kubectl get pods -n spire-mgmt

# Check SPIRE Server logs
kubectl logs -n spire-server deployment/spire-server --tail=50

Tekton Pipelines (Build System)

# Tekton components
kubectl get pods -n tekton-pipelines

# Pipeline runs
kubectl get pipelineruns -A

# Task runs
kubectl get taskruns -A

# Recent pipeline runs status
kubectl get pipelineruns -A --sort-by=.metadata.creationTimestamp | tail -10

Resource Usage

# Node resources (if metrics-server installed)
kubectl top nodes

# Pod resources
kubectl top pods -A --sort-by=memory | head -20
kubectl top pods -A --sort-by=cpu | head -20

# Namespace resource usage
kubectl top pods -n team1
kubectl top pods -n keycloak
kubectl top pods -n kagenti-system

# Docker container stats
docker stats --no-stream

Events (Recent Issues)

# All recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Events in specific namespace
kubectl get events -n team1 --sort-by='.lastTimestamp'

# Warning events only
kubectl get events -A --field-selector type=Warning

# Events for specific pod
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>

Component-Specific Health Checks

Keycloak Authentication

# Check Keycloak readiness
kubectl exec -n keycloak deployment/keycloak -c keycloak -- \
  curl -sf http://localhost:8080/health/ready && echo "✓ Keycloak Ready" || echo "✗ Keycloak Not Ready"

# Get admin credentials
KEYCLOAK_USER=$(kubectl get secret -n keycloak keycloak-initial-admin -o jsonpath='{.data.username}' | base64 -d)
KEYCLOAK_PASS=$(kubectl get secret -n keycloak keycloak-initial-admin -o jsonpath='{.data.password}' | base64 -d)
echo "Username: $KEYCLOAK_USER"
echo "Password: $KEYCLOAK_PASS"

# Test Keycloak OIDC endpoint
curl -k "http://keycloak.localtest.me:8080/realms/master/.well-known/openid-configuration" | python3 -m json.tool

Kagenti UI

# Check UI deployment
kubectl get deployment -n kagenti-system kagenti-ui

# Check UI pods
kubectl get pods -n kagenti-system -l app=kagenti-ui

# Check UI logs
kubectl logs -n kagenti-system deployment/kagenti-ui --tail=50

# Access UI
open http://kagenti-ui.localtest.me:8080

Observability Stack (if addons installed)

# Prometheus
kubectl get pods -n observability -l app=prometheus
kubectl exec -n observability deployment/prometheus -- \
  curl -sf http://localhost:9090/-/ready && echo "✓ Prometheus Ready" || echo "✗ Not Ready"

# Port-forward to access
kubectl port-forward -n observability svc/prometheus 9090:9090 &
open http://localhost:9090

# Kiali
kubectl get pods -n observability -l app=kiali
kubectl port-forward -n observability svc/kiali 20001:20001 &
open http://localhost:20001

# Phoenix (LLM tracing)
kubectl get pods -n observability -l app=phoenix
open http://phoenix.localtest.me:8080

Health Check Checklists

Post-Deployment Health Check

All critical deployments ready (weather-tool, weather-service, keycloak, operator)
No pods in CrashLoopBackOff/ImagePullBackOff/Error
All services have endpoints
Resource usage within limits (< 80% memory, < 70% CPU)
No warning/error events in last 5 minutes
E2E tests passing
Platform services accessible

Pre-Change Health Check

Capture current pod list: kubectl get pods -A > baseline-pods.txt
All critical components healthy
No existing issues in logs
Resource headroom available
Recent Git commits validated

Incident Investigation Health Check

Identify degraded components
Check recent events: kubectl get events -A --sort-by='.lastTimestamp' | tail -30
Collect logs from affected pods
Check for resource exhaustion
Review recent changes

Common Health Issues

Issue: Pods stuck in Pending

# Check pod description for reason
kubectl describe pod <pod-name> -n <namespace>

# Common causes:
# - Insufficient CPU/memory
# - No nodes available
# - Unbound PersistentVolumeClaim
# - Image pull errors

# Check node resources
kubectl top nodes
kubectl describe node <node-name>

Issue: Pods in CrashLoopBackOff

# Check previous logs (before crash)
kubectl logs <pod-name> -n <namespace> --previous

# Check current logs
kubectl logs <pod-name> -n <namespace>

# Check events
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>

# Describe pod for error details
kubectl describe pod <pod-name> -n <namespace>

# Common causes:
# - Application error on startup
# - Missing configuration/secrets
# - Dependency not available
# - Liveness/readiness probe failing

Issue: Deployment not ready

# Check deployment status
kubectl get deployment -n <namespace> <deployment-name>
kubectl describe deployment -n <namespace> <deployment-name>

# Check replica set
kubectl get rs -n <namespace>
kubectl describe rs -n <namespace> <replicaset-name>

# Check pods
kubectl get pods -n <namespace> -l app=<label>

# Force rollout restart
kubectl rollout restart deployment/<deployment-name> -n <namespace>

# Check rollout status
kubectl rollout status deployment/<deployment-name> -n <namespace>

Issue: Service has no endpoints

# Check service
kubectl get svc -n <namespace> <service-name>
kubectl describe svc -n <namespace> <service-name>

# Check endpoints
kubectl get endpoints -n <namespace> <service-name>

# Common causes:
# - No pods with matching labels
# - Pods not ready (failing health checks)
# - Selector mismatch

# Verify pod labels match service selector
kubectl get pods -n <namespace> --show-labels
kubectl get svc -n <namespace> <service-name> -o yaml | grep -A5 selector

Issue: High resource usage

# Find top consumers
kubectl top pods -A --sort-by=memory | head -10
kubectl top pods -A --sort-by=cpu | head -10

# Check resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Limits:"

# Check for OOM kills
kubectl get events -A | grep -i "OOMKilled"

# Increase resources (edit deployment)
kubectl edit deployment -n <namespace> <deployment-name>

Issue: ImagePullBackOff

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Common causes:
# - Image doesn't exist
# - Wrong image tag
# - No access to registry
# - Network issues

# For Kind cluster, check if image is loaded
docker exec agent-platform-control-plane crictl images | grep <image-name>

# Load image into Kind
kind load docker-image <image-name> --name agent-platform

Automated Monitoring

Watch Commands

# Watch all pods
watch -n 5 'kubectl get pods -A'

# Watch failing pods only
watch -n 5 'kubectl get pods -A | grep -vE "Running|Completed"'

# Watch deployments
watch -n 5 'kubectl get deployments -A'

# Watch specific namespace
watch -n 5 'kubectl get pods -n team1'

# Watch events
watch -n 10 'kubectl get events -A --sort-by=.lastTimestamp | tail -20'

Continuous Health Monitoring

# Run health check in loop
while true; do
  echo "=== Health Check $(date) ==="
  .github/scripts/verify_deployment.sh
  echo "Waiting 5 minutes..."
  sleep 300
done

Integration with Other Skills

After health check, if issues found:

Use k8s:logs skill to examine error logs
Use k8s:pods skill for pod debugging
Use kagenti:deploy skill if full redeploy needed

Pro Tips

Always baseline first: Run health check BEFORE making changes
Use automated script: .github/scripts/verify_deployment.sh for comprehensive check
Run E2E tests: Tests validate end-to-end functionality
Check critical components first: weather-tool, keycloak, operator
Look for patterns: Multiple pods failing indicates cluster-wide issue
Check events: Recent events often reveal root cause
Verify after fixes: Always re-run health check after remediation
Use --previous logs: For crashlooping pods, check logs before crash

Related Skills

kagenti:deploy: Deploy or redeploy the platform
k8s:logs: Query and analyze logs
k8s:pods: Debug specific pod issues