name: rca:hypershift description: Root cause analysis with live cluster - full access to pods, logs, secrets, configs for deep investigation
RCA-HyperShift: Root Cause Analysis with Live Cluster
flowchart TD
START(["/rca:hypershift"]) --> P1["Phase 1: Observe"]:::rca
P1 --> P2["Phase 2: Inspect"]:::rca
P2 --> P3["Phase 3: Reproduce"]:::rca
P3 --> P4["Phase 4: Trace"]:::rca
P4 --> ROOT{"Root cause found?"}
ROOT -->|Yes| P5["Phase 5: Document"]:::rca
ROOT -->|No| P2
P5 --> TDD["tdd:hypershift"]:::tdd
classDef rca fill:#FF5722,stroke:#333,color:white
classDef tdd fill:#4CAF50,stroke:#333,color:white
Follow this diagram as the workflow.
Systematic root cause analysis with full cluster access for deep investigation.
Context-Safe Execution (MANDATORY)
RCA with cluster access generates massive context pollution from kubectl describe, logs, configmap dumps, and secret inspection. ALL output MUST go to files.
# Session-scoped log directory
export LOG_DIR=/tmp/kagenti/rca/${WORKTREE:-$CLUSTER}
mkdir -p $LOG_DIR
Rules:
- ALL kubectl/oc commands redirect to
$LOG_DIR/<name>.log - ALL analysis happens in subagents:
Task(subagent_type='Explore') - Main context only sees: OK/FAIL status and subagent summaries
- Use subagents for verification too — "check if traces appear in $LOG_DIR/otel.log"
rca:hypershift vs rca:ci
| Aspect | rca:hypershift |
rca:ci |
|---|---|---|
| Access | Full cluster (pods, logs, secrets, configs) | CI logs only |
| Debugging | Real-time with k8s:* skills |
Static log analysis |
| State | Can inspect current state | Historical artifacts only |
| When | Have cluster, need deep investigation | No cluster available |
When to Use
rca:ciwas inconclusive- Need to inspect live pod state, secrets, or configs
- Want to reproduce failure with debugging enabled
- Complex multi-component issues
Prerequisites
Auto-approved: All read operations on hosted clusters are auto-approved. Run each command separately for auto-approve to work.
Set cluster context:
export CLUSTER=<suffix> MANAGED_BY_TAG=${MANAGED_BY_TAG:-kagenti-hypershift-custom}
export KUBECONFIG=~/clusters/hcp/$MANAGED_BY_TAG-$CLUSTER/auth/kubeconfig
Verify connection (small output, OK inline):
kubectl get nodes
RCA Workflow
┌─────────────────────────────────────────────────────────────────┐
│ 1. OBSERVE │
│ ├─ Check pod status │
│ ├─ Get recent events │
│ └─ Review current logs │
├─────────────────────────────────────────────────────────────────┤
│ 2. INSPECT │
│ ├─ Examine failing component │
│ ├─ Check configs and secrets │
│ └─ Verify connectivity │
├─────────────────────────────────────────────────────────────────┤
│ 3. REPRODUCE │
│ ├─ Run failing test with verbose output │
│ ├─ Watch logs in real-time │
│ └─ Capture exact failure │
├─────────────────────────────────────────────────────────────────┤
│ 4. TRACE │
│ ├─ Follow request through components │
│ ├─ Identify where failure occurs │
│ └─ Determine root cause │
├─────────────────────────────────────────────────────────────────┤
│ 5. DOCUMENT │
│ ├─ Root cause with evidence │
│ ├─ Reproduction steps │
│ └─ Fix and verification plan │
└─────────────────────────────────────────────────────────────────┘
Phase 1: Observe Current State
Pod Status
Check all pods:
kubectl get pods -A
Find problem pods:
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
Check kagenti-system namespace:
kubectl get pods -n kagenti-system
Check team1 namespace:
kubectl get pods -n team1
Recent Events
Cluster-wide events:
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
Namespace events:
kubectl get events -n kagenti-system --sort-by='.lastTimestamp'
Current Logs
OTEL Collector logs:
kubectl logs -n kagenti-system deployment/otel-collector --tail=100
MLflow logs:
kubectl logs -n kagenti-system deployment/mlflow --tail=100
Agent logs:
kubectl logs -n team1 deployment/weather-service --tail=100
Filter for errors:
kubectl logs -n kagenti-system deployment/otel-collector --tail=500 | grep -iE "error|fail|warn"
Phase 2: Inspect Components
Check Pod Details
Describe failing pod:
kubectl describe pod <pod-name> -n <namespace>
Check container status:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'
Examine Configuration
List ConfigMaps:
kubectl get configmap -n kagenti-system
View specific ConfigMap:
kubectl get configmap otel-collector-config -n kagenti-system -o yaml
List Secrets (check existence, not values):
kubectl get secrets -n kagenti-system
Check secret keys:
kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data}' | jq 'keys'
Decode specific secret value:
kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data.<key>}' | base64 -d
Verify Connectivity
Service endpoints:
kubectl get endpoints -n kagenti-system
Test internal connectivity:
kubectl run -it --rm debug --image=curlimages/curl -- \
curl -v http://mlflow.kagenti-system.svc.cluster.local:5000/health
Check routes (OpenShift):
kubectl get routes -A
Phase 3: Reproduce with Debugging
Run Failing Test
Set environment variables:
export CLUSTER=<suffix> WORKTREE=<worktree> MANAGED_BY_TAG=${MANAGED_BY_TAG:-kagenti-hypershift-custom}
Run specific test with verbose output:
KUBECONFIG=~/clusters/hcp/$MANAGED_BY_TAG-$CLUSTER/auth/kubeconfig \
.worktrees/$WORKTREE/.github/scripts/local-setup/hypershift-full-test.sh $CLUSTER \
--include-test --pytest-filter "<test_name>" --pytest-args "-v -s"
Watch Logs in Real-Time
Watch component logs (in separate terminal):
kubectl logs -f -n kagenti-system deployment/otel-collector
Or use stern for multiple pods:
stern -n kagenti-system .
Phase 4: Trace the Failure
Request Flow Analysis
For a typical agent request:
Client → Gateway → Agent → Tool → Agent → Gateway → Client
↓
OTEL Collector → MLflow
Check each hop:
- Did the request reach the gateway?
- Did the agent receive it?
- Did the tool respond?
- Were traces exported?
- Did MLflow receive them?
Component-Specific Checks
OTEL Collector:
kubectl logs -n kagenti-system deployment/otel-collector | grep -i "span\|trace\|export"
MLflow:
kubectl logs -n kagenti-system deployment/mlflow | grep -i "trace\|experiment\|error"
Agent:
kubectl logs -n team1 deployment/weather-service | grep -i "request\|response\|error"
Auth Flow Analysis
Get Keycloak host:
KEYCLOAK_HOST=$(kubectl get route keycloak -n keycloak -o jsonpath='{.spec.host}')
Get OAuth client credentials:
CLIENT_ID=$(kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data.OIDC_CLIENT_ID}' | base64 -d)
CLIENT_SECRET=$(kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data.OIDC_CLIENT_SECRET}' | base64 -d)
Test OAuth token exchange:
curl -sk -X POST "https://$KEYCLOAK_HOST/realms/master/protocol/openid-connect/token" \
-d "grant_type=client_credentials" \
-d "client_id=$CLIENT_ID" \
-d "client_secret=$CLIENT_SECRET"
Phase 5: Document Findings
## Root Cause Analysis
**Failure**: [Test name / description]
**Cluster**: [cluster suffix]
**Date**: [timestamp]
### Symptoms
- [What failed]
- [Error messages]
### Investigation Steps
1. [Step taken] → [Finding]
2. [Step taken] → [Finding]
### Root Cause
[Clear statement with evidence]
### Evidence
[Relevant log lines or command output]
### Fix
[Proposed solution]
### Verification
[How to confirm fix works]
CVE Check Before Publishing Findings
Before posting RCA findings to any public destination (issue comment, PR comment, etc.):
If the root cause involves a dependency bug, unexpected behavior, or version issue:
- Invoke
cve:scanto check if this is a known CVE - If a CVE is found → invoke
cve:brainstormBEFORE documenting findings publicly - Rewrite RCA documentation to use neutral language (no CVE IDs, no vulnerability descriptions)
- Report the CVE through proper channels (see
cve:brainstorm)
Quick Reference
| Task | Command |
|---|---|
| Pod status | kubectl get pods -A |
| Pod logs | kubectl logs -n <ns> deployment/<name> |
| Describe pod | kubectl describe pod <name> -n <ns> |
| Get secret | kubectl get secret <name> -n <ns> -o yaml |
| Events | kubectl get events -A --sort-by='.lastTimestamp' |
| ConfigMap | kubectl get configmap <name> -n <ns> -o yaml |
Related Skills
rca:ci- RCA from CI logs onlytdd:hypershift- Fix iteration after RCAk8s:pods- Pod debuggingk8s:logs- Log analysisk8s:health- Platform health checkk8s:live-debugging- Iterative debuggingauth:mlflow-oidc-auth- MLflow OIDC auth debuggingopenshift:debug- Debug OpenShift-specific operators, SCCs, buildsopenshift:routes- Debug route/ingress issuescve:scan- CVE scanning (check if root cause is a known CVE)cve:brainstorm- Disclosure planning (if CVE found during RCA)