k8slogs - SKILL.md Agent Skill

name: k8s:logs description: Query and analyze logs from Kagenti platform components, search for errors, and investigate issues

Check Logs Skill

This skill helps you query and analyze logs from the Kagenti platform components.

Context-Safe Execution (MANDATORY)

All kubectl log commands MUST redirect output to files. Log output is the single biggest source of context pollution. Commands below are shown in bare form for readability. When executing, always redirect:

export LOG_DIR=/tmp/kagenti/k8s/${CLUSTER:-local}
mkdir -p $LOG_DIR

# Pattern for all log commands:
kubectl logs -n <ns> <target> --tail=100 > $LOG_DIR/<component>.log 2>&1 && echo "OK" || echo "FAIL"

# NEVER dump logs into main context — always analyze in subagent:
# Task(subagent_type='Explore') to read $LOG_DIR/<component>.log and:
#   - Find errors/warnings
#   - Check for specific patterns
#   - Return a concise summary

When to Use

User asks "show me logs for X"
Investigating errors or failures
After deployments to check for issues
Debugging pod crashes or restarts
Analyzing application behavior
Finding root cause of incidents

Quick Log Access

View Logs for Common Components

# Weather Tool
kubectl logs -n team1 deployment/weather-tool --tail=100
kubectl logs -n team1 deployment/weather-tool --tail=100 -f  # Follow

# Weather Service
kubectl logs -n team1 deployment/weather-service --tail=100

# Keycloak
kubectl logs -n keycloak deployment/keycloak --tail=100 2>/dev/null || \
kubectl logs -n keycloak statefulset/keycloak --tail=100

# Platform Operator
kubectl logs -n kagenti-system -l control-plane=controller-manager --tail=100

# Istio Control Plane
kubectl logs -n istio-system deployment/istiod --tail=100

# SPIRE Server
kubectl logs -n spire-server deployment/spire-server --tail=100

# Tekton Controller
kubectl logs -n tekton-pipelines deployment/tekton-pipelines-controller --tail=100

# Kagenti UI
kubectl logs -n kagenti-platform deployment/kagenti-ui --tail=100

Check Logs from Previous Container (After Crash)

# If pod is crashlooping, check logs before the crash
kubectl logs -n <namespace> <pod-name> --previous

# Example for weather-tool
kubectl logs -n team1 deployment/weather-tool --previous

Follow Logs in Real-Time

# Follow logs
kubectl logs -n <namespace> <pod-name> -f --tail=20

# Follow multiple containers
kubectl logs -n <namespace> <pod-name> --all-containers=true -f --tail=20

Search for Errors

Find Errors Across All Namespaces

# Search for errors in all pods
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
  echo "=== Checking namespace: $ns ==="
  kubectl logs -n $ns --all-containers=true --tail=100 2>&1 | \
    grep -iE "error|fatal|exception|panic" | head -10
done

Find Errors in Specific Namespace

# Team1 namespace
kubectl logs -n team1 --all-containers=true --tail=200 | \
  grep -iE "error|fatal|exception"

# Keycloak namespace
kubectl logs -n keycloak --all-containers=true --tail=200 | \
  grep -iE "error|fatal|exception"

# Platform operator
kubectl logs -n kagenti-system -l control-plane=controller-manager --tail=200 | \
  grep -iE "error|fatal|exception"

Search for Specific Error Patterns

# Connection errors
kubectl logs -n <namespace> <pod-name> | \
  grep -iE "connection (refused|timeout|reset|failed)"

# Authentication failures
kubectl logs -n keycloak --all-containers=true | \
  grep -iE "auth.*fail|unauthorized|forbidden|invalid.*token"

# OOM (Out of Memory) errors
kubectl logs -n <namespace> <pod-name> --previous | \
  grep -iE "OOM|out of memory|oom.*kill"

# Image pull errors (check events instead of logs)
kubectl get events -n <namespace> | grep -i "pull"

# Network errors
kubectl logs -n <namespace> <pod-name> | \
  grep -iE "network|dns|timeout|unreachable"

Component-Specific Log Checks

Weather Tool & Service (Demo Agents)

# Check weather-tool logs
kubectl logs -n team1 deployment/weather-tool --tail=100

# Check for errors
kubectl logs -n team1 deployment/weather-tool | grep -iE "error|fail"

# Check weather-service logs (agent)
kubectl logs -n team1 deployment/weather-service --tail=100

# Check for LLM/API errors
kubectl logs -n team1 deployment/weather-service | grep -iE "api|llm|openai|error"

# Check sidecar (Istio proxy) logs
POD=$(kubectl get pod -n team1 -l app=weather-tool -o jsonpath='{.items[0].metadata.name}')
kubectl logs -n team1 $POD -c istio-proxy --tail=50

Keycloak (Authentication)

# Keycloak application logs
kubectl logs -n keycloak deployment/keycloak -c keycloak --tail=100 2>/dev/null || \
kubectl logs -n keycloak statefulset/keycloak --tail=100

# Search for authentication errors
kubectl logs -n keycloak --all-containers=true | \
  grep -iE "login.*fail|authentication.*error|invalid.*credential"

# Search for startup errors
kubectl logs -n keycloak --all-containers=true --tail=500 | \
  grep -iE "error|exception|fail" | head -20

# PostgreSQL logs (Keycloak backend)
kubectl logs -n keycloak deployment/postgresql --tail=100 2>/dev/null

Platform Operator

# Operator logs
kubectl logs -n kagenti-system -l control-plane=controller-manager --tail=100

# Search for reconciliation errors
kubectl logs -n kagenti-system -l control-plane=controller-manager | \
  grep -iE "error|fail|reconcile"

# Check Component CRD processing
kubectl logs -n kagenti-system -l control-plane=controller-manager | \
  grep -i "component"

Istio Service Mesh

# Istiod (control plane) logs
kubectl logs -n istio-system deployment/istiod --tail=100

# Search for certificate/mTLS errors
kubectl logs -n istio-system deployment/istiod | \
  grep -iE "tls|certificate|mtls|error"

# Check specific pod's sidecar logs
kubectl logs -n <namespace> <pod-name> -c istio-proxy --tail=50

# Search for connection errors in sidecar
kubectl logs -n <namespace> <pod-name> -c istio-proxy | \
  grep -iE "error|upstream|connection"

Tekton Pipelines (Build System)

# Tekton controller logs
kubectl logs -n tekton-pipelines deployment/tekton-pipelines-controller --tail=100

# Tekton webhook logs
kubectl logs -n tekton-pipelines deployment/tekton-pipelines-webhook --tail=100

# Check pipeline run logs
PIPELINERUN=$(kubectl get pipelinerun -A -o jsonpath='{.items[-1:].metadata.name}')
NAMESPACE=$(kubectl get pipelinerun -A -o jsonpath='{.items[-1:].metadata.namespace}')
kubectl logs -n $NAMESPACE pipelinerun/$PIPELINERUN

# Check for build errors
kubectl get pipelineruns -A -o json | \
  jq -r '.items[] | select(.status.conditions[]?.type=="Succeeded" and .status.conditions[]?.status=="False") | .metadata.name'

SPIRE (Workload Identity)

# SPIRE Server logs
kubectl logs -n spire-server deployment/spire-server --tail=100

# SPIRE Agent logs (runs on nodes)
kubectl logs -n spire-mgmt daemonset/spire-agent --tail=50

# Search for identity/attestation errors
kubectl logs -n spire-server deployment/spire-server | \
  grep -iE "error|attest|identity|fail"

Log Analysis Patterns

Detect Crash Loops

# Find pods with high restart counts
kubectl get pods -A | awk '$4 > 3 {print $0}'

# For each crashlooping pod, check previous logs
kubectl get pods -A --field-selector=status.phase=Running | \
  awk '$4 > 3 {print $1, $2}' | \
  while read ns pod; do
    echo "=== Logs for $pod in $ns (before crash) ==="
    kubectl logs -n $ns $pod --previous --tail=30 2>&1
    echo
  done

Find HTTP Errors

# Search for HTTP error codes in logs
kubectl logs -n <namespace> <pod-name> | \
  grep -oE "HTTP/[0-9.]+ [4-5][0-9]{2}" | sort | uniq -c

# Example: Weather service HTTP errors
kubectl logs -n team1 deployment/weather-service | \
  grep -oE "[4-5][0-9]{2}" | sort | uniq -c

Timeline Analysis

# Get logs with timestamps
kubectl logs -n <namespace> <pod-name> --timestamps=true --tail=200

# Filter logs by time window (last 10 minutes)
kubectl logs -n <namespace> <pod-name> --since=10m

# Logs since specific time
kubectl logs -n <namespace> <pod-name> --since-time='2025-11-21T12:00:00Z'

Aggregate Logs from Multiple Pods

# Get logs from all pods with same label
kubectl logs -n <namespace> -l app=<label> --all-containers=true --tail=50

# Example: All weather-tool pods
kubectl logs -n team1 -l app=weather-tool --all-containers=true --tail=100

# Save to file for analysis
kubectl logs -n team1 -l app=weather-tool --all-containers=true --tail=500 > weather-tool-logs.txt

Advanced Log Queries

Compare Logs Before and After Event

# Save logs before change
kubectl logs -n <namespace> <pod-name> --tail=500 > before.log

# Make change, wait, then capture after
kubectl logs -n <namespace> <pod-name> --tail=500 > after.log

# Compare
diff before.log after.log

Extract Structured Logs (JSON)

# If logs are in JSON format
kubectl logs -n <namespace> <pod-name> --tail=100 | \
  while read line; do echo "$line" | python3 -m json.tool 2>/dev/null || echo "$line"; done

# Extract specific JSON fields
kubectl logs -n <namespace> <pod-name> --tail=100 | \
  grep -E '^\{' | jq -r '.level, .message'

Log Volume Analysis

# Count log lines per pod
for pod in $(kubectl get pods -n <namespace> -o jsonpath='{.items[*].metadata.name}'); do
  count=$(kubectl logs -n <namespace> $pod --tail=1000 2>/dev/null | wc -l)
  echo "$pod: $count lines"
done

Troubleshooting with Logs

Issue: Pod not starting

# Check init container logs
kubectl logs -n <namespace> <pod-name> -c <init-container-name>

# Check events + logs together
kubectl describe pod -n <namespace> <pod-name>
kubectl logs -n <namespace> <pod-name> --all-containers=true

Issue: Intermittent failures

# Follow logs and watch for errors
kubectl logs -n <namespace> <pod-name> -f | grep -iE "error|fail"

# Capture logs over time
while true; do
  kubectl logs -n <namespace> <pod-name> --since=1m --timestamps >> captured-logs.txt
  sleep 60
done

Issue: Performance degradation

# Look for slow query logs
kubectl logs -n <namespace> <pod-name> | grep -iE "slow|timeout|latency"

# Look for resource warnings
kubectl logs -n <namespace> <pod-name> | grep -iE "memory|cpu|resource"

Log Retention and Export

Export Logs for Analysis

# Export all logs from namespace
kubectl logs -n <namespace> --all-containers=true --tail=10000 > namespace-logs.txt

# Export logs from multiple components
mkdir -p logs
kubectl logs -n team1 deployment/weather-tool > logs/weather-tool.log
kubectl logs -n team1 deployment/weather-service > logs/weather-service.log
kubectl logs -n keycloak --all-containers=true > logs/keycloak.log
kubectl logs -n kagenti-system -l control-plane=controller-manager > logs/operator.log

# Create tarball
tar -czf logs-$(date +%Y%m%d-%H%M%S).tar.gz logs/

Share Logs for Support

# Create debug bundle
./scripts/collect-debug-logs.sh  # If script exists

# Or manually collect key logs
mkdir -p debug-bundle
kubectl get pods -A > debug-bundle/pods.txt
kubectl get events -A --sort-by='.lastTimestamp' > debug-bundle/events.txt
kubectl logs -n team1 --all-containers=true --tail=500 > debug-bundle/team1-logs.txt
kubectl logs -n keycloak --all-containers=true --tail=500 > debug-bundle/keycloak-logs.txt
tar -czf debug-bundle-$(date +%Y%m%d-%H%M%S).tar.gz debug-bundle/

Pro Tips

Always check previous logs: For crashlooping pods, --previous shows what caused the crash
Use grep filters: Don't dump all logs, filter for errors/warnings first
Check timestamps: Use --timestamps=true to correlate logs with events
Check all containers: Use --all-containers=true for pods with sidecars
Tail appropriately: Start with --tail=100, increase if needed
Follow for real-time: Use -f to watch logs as they happen
Save to files: For complex analysis, save logs to files
Combine with events: Logs + events give complete picture
Check sidecar logs: Istio proxy logs often reveal network issues
Use since flags: --since=10m for recent logs only

Related Skills

k8s:health: Check overall platform health first
k8s:pods: Debug specific pod issues
kagenti:deploy: Redeploy if logs show persistent issues