name: k8s:live-debugging description: Iterative debugging workflow for fixing issues on a running cluster
Live Cluster Debugging Workflow
Iterative debugging workflow for fixing issues on a running HyperShift cluster.
Context-Safe Execution (MANDATORY)
All kubectl/oc commands MUST redirect output to files. Live debugging generates the most context pollution because of iterative check-fix-recheck loops.
export LOG_DIR=/tmp/kagenti/k8s/${CLUSTER:-local}
mkdir -p $LOG_DIR
# Every kubectl command → redirect to file
kubectl <command> > $LOG_DIR/<name>.log 2>&1 && echo "OK" || echo "FAIL"
# Analyze in subagent: Task(subagent_type='Explore') to read log files
# Use subagents for BOTH failure analysis AND verifying expected behavior
Table of Contents
- Overview
- Prerequisites
- Workflow
- Common Debugging Scenarios
- Environment Variable Quick Reference
- Useful One-Liners
- After Debugging
Overview
When tests fail on a deployed cluster, use this workflow to:
- Diagnose the root cause
- Make targeted fixes
- Verify the fix without full redeployment
Prerequisites
# Set the kubeconfig for your cluster
export KUBECONFIG=~/clusters/hcp/kagenti-hypershift-custom-<suffix>/auth/kubeconfig
# Verify connection
kubectl get nodes
Workflow
1. Check Test Results
# View test results XML
cat test-results/e2e-results.xml
# Or re-run failing test with verbose output
pytest kagenti/tests/e2e/common/test_mlflow_traces.py -v -s
2. Check Pod Status
# Get all pods in relevant namespace
kubectl get pods -n kagenti-system
# Check specific component
kubectl get pods -n kagenti-system -l app=otel-collector
# Describe problematic pod
kubectl describe pod -n kagenti-system <pod-name>
3. Check Logs
# Get recent logs
kubectl logs -n kagenti-system deployment/otel-collector --tail=100
# Stream logs in real-time
kubectl logs -n kagenti-system deployment/otel-collector -f
# Filter for errors
kubectl logs -n kagenti-system deployment/otel-collector --tail=200 | grep -iE "(error|fail|403|401)"
4. Check Configuration
# View ConfigMap contents
kubectl get configmap otel-collector-config -n kagenti-system -o yaml
# Check Secret contents (decoded)
kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data.OIDC_CLIENT_ID}' | base64 -d
# View rendered Helm values
helm get values kagenti-deps -n kagenti-system > /tmp/kagenti-deps-values.yaml
cat /tmp/kagenti-deps-values.yaml
5. Check Authorization
# View AuthorizationPolicy
kubectl get authorizationpolicy -n kagenti-system -o yaml
# Check waypoint proxy
kubectl get gateway -n kagenti-system
# Check service labels
kubectl get svc mlflow -n kagenti-system -o yaml | grep -A5 labels
6. Make Chart Changes
# Edit the chart template
vim charts/kagenti-deps/templates/otel-collector.yaml
# Apply the change
helm upgrade kagenti-deps charts/kagenti-deps -n kagenti-system \
-f /tmp/kagenti-deps-values.yaml
7. Restart Affected Pods
# Rollout restart to pick up ConfigMap changes
kubectl rollout restart deployment/otel-collector -n kagenti-system
# Wait for rollout to complete
kubectl rollout status deployment/otel-collector -n kagenti-system --timeout=60s
8. Generate Test Data
# Get route to weather service
ROUTE_HOST=$(kubectl get route weather-service -n team1 -o jsonpath='{.spec.host}')
# Send test request
curl -sk -X POST "https://$ROUTE_HOST/" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"message/send","params":{"message":{"messageId":"test-123","parts":[{"kind":"text","text":"What is the weather?"}],"role":"user"}}}'
9. Verify Fix
# Check logs after test request
kubectl logs -n kagenti-system deployment/otel-collector --tail=50
# Run the specific failing test
pytest kagenti/tests/e2e/common/test_mlflow_traces.py::test_mlflow_has_traces -v
Common Debugging Scenarios
OAuth/Authentication Issues
# Check if OAuth extension started
kubectl logs -n kagenti-system deployment/otel-collector | grep oauth2client
# Test token acquisition
KEYCLOAK_HOST=$(kubectl get route keycloak -n keycloak -o jsonpath='{.spec.host}')
CLIENT_ID=$(kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data.OIDC_CLIENT_ID}' | base64 -d)
CLIENT_SECRET=$(kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data.OIDC_CLIENT_SECRET}' | base64 -d)
curl -sk -X POST "https://$KEYCLOAK_HOST/realms/master/protocol/openid-connect/token" \
-d "grant_type=client_credentials" \
-d "client_id=$CLIENT_ID" \
-d "client_secret=$CLIENT_SECRET"
Mesh/Istio Issues
# Check Istiod logs for authorization warnings
kubectl logs -n istio-system deployment/istiod --tail=100 | grep -i authorization
# Check if pods are in ambient mode
kubectl get pod -n kagenti-system -l app=otel-collector -o jsonpath='{.items[0].metadata.annotations}'
# Verify trust domain
kubectl get configmap istio -n istio-system -o jsonpath='{.data.mesh}' | grep trustDomain
Trace Export Issues
# Add debug exporter to pipeline (in otel-collector.yaml)
# exporters: [ debug, otlphttp/mlflow ]
# Check debug output for traces
kubectl logs -n kagenti-system deployment/otel-collector | grep "Span #"
# Check for export errors
kubectl logs -n kagenti-system deployment/otel-collector | grep -i "drop\|error\|fail"
Environment Variable Quick Reference
# Weather service check
kubectl get pod -n team1 -l app=weather-service -o jsonpath='{.items[0].spec.containers[0].env}' | jq
# OTEL collector environment
kubectl get pod -n kagenti-system -l app=otel-collector -o jsonpath='{.items[0].spec.containers[0].env}' | jq
Useful One-Liners
# Get all routes
kubectl get routes -A
# Check all deployments ready
kubectl get deployments -n kagenti-system
# Watch pod status
watch kubectl get pods -n kagenti-system
# Quick port-forward for testing
kubectl port-forward -n kagenti-system svc/mlflow 5000:5000
After Debugging
Once the fix is verified:
- Run full test suite:
pytest kagenti/tests/e2e/ -v - Commit changes:
git add -A && git commit -m "fix: <description>" - Document findings: Update relevant skills or CLAUDE.md
Related Skills
tdd:hypershifttesting:kubectl-debugging