rcahypershift - SKILL.md Agent Skill

name: rca:hypershift description: Root cause analysis with live cluster - full access to pods, logs, secrets, configs for deep investigation

RCA-HyperShift: Root Cause Analysis with Live Cluster

flowchart TD
    START(["/rca:hypershift"]) --> P1["Phase 1: Observe"]:::rca
    P1 --> P2["Phase 2: Inspect"]:::rca
    P2 --> P3["Phase 3: Reproduce"]:::rca
    P3 --> P4["Phase 4: Trace"]:::rca
    P4 --> ROOT{"Root cause found?"}
    ROOT -->|Yes| P5["Phase 5: Document"]:::rca
    ROOT -->|No| P2
    P5 --> TDD["tdd:hypershift"]:::tdd

    classDef rca fill:#FF5722,stroke:#333,color:white
    classDef tdd fill:#4CAF50,stroke:#333,color:white

Follow this diagram as the workflow.

Systematic root cause analysis with full cluster access for deep investigation.

Context-Safe Execution (MANDATORY)

RCA with cluster access generates massive context pollution from kubectl describe, logs, configmap dumps, and secret inspection. ALL output MUST go to files.

# Session-scoped log directory
export LOG_DIR=/tmp/kagenti/rca/${WORKTREE:-$CLUSTER}
mkdir -p $LOG_DIR

Rules:

ALL kubectl/oc commands redirect to $LOG_DIR/<name>.log
ALL analysis happens in subagents: Task(subagent_type='Explore')
Main context only sees: OK/FAIL status and subagent summaries
Use subagents for verification too — "check if traces appear in $LOG_DIR/otel.log"

rca:hypershift vs rca:ci

Aspect	`rca:hypershift`	`rca:ci`
Access	Full cluster (pods, logs, secrets, configs)	CI logs only
Debugging	Real-time with `k8s:*` skills	Static log analysis
State	Can inspect current state	Historical artifacts only
When	Have cluster, need deep investigation	No cluster available

When to Use

rca:ci was inconclusive
Need to inspect live pod state, secrets, or configs
Want to reproduce failure with debugging enabled
Complex multi-component issues

Prerequisites

Auto-approved: All read operations on hosted clusters are auto-approved. Run each command separately for auto-approve to work.

Set cluster context:

export CLUSTER=<suffix> MANAGED_BY_TAG=${MANAGED_BY_TAG:-kagenti-hypershift-custom}
export KUBECONFIG=~/clusters/hcp/$MANAGED_BY_TAG-$CLUSTER/auth/kubeconfig

Verify connection (small output, OK inline):

kubectl get nodes

RCA Workflow

┌─────────────────────────────────────────────────────────────────┐
│  1. OBSERVE                                                     │
│     ├─ Check pod status                                         │
│     ├─ Get recent events                                        │
│     └─ Review current logs                                      │
├─────────────────────────────────────────────────────────────────┤
│  2. INSPECT                                                     │
│     ├─ Examine failing component                                │
│     ├─ Check configs and secrets                                │
│     └─ Verify connectivity                                      │
├─────────────────────────────────────────────────────────────────┤
│  3. REPRODUCE                                                   │
│     ├─ Run failing test with verbose output                     │
│     ├─ Watch logs in real-time                                  │
│     └─ Capture exact failure                                    │
├─────────────────────────────────────────────────────────────────┤
│  4. TRACE                                                       │
│     ├─ Follow request through components                        │
│     ├─ Identify where failure occurs                            │
│     └─ Determine root cause                                     │
├─────────────────────────────────────────────────────────────────┤
│  5. DOCUMENT                                                    │
│     ├─ Root cause with evidence                                 │
│     ├─ Reproduction steps                                       │
│     └─ Fix and verification plan                                │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Observe Current State

Pod Status

Check all pods:

kubectl get pods -A

Find problem pods:

kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

Check kagenti-system namespace:

kubectl get pods -n kagenti-system

Check team1 namespace:

kubectl get pods -n team1

Recent Events

Cluster-wide events:

kubectl get events -A --sort-by='.lastTimestamp' | tail -30

Namespace events:

kubectl get events -n kagenti-system --sort-by='.lastTimestamp'

Current Logs

OTEL Collector logs:

kubectl logs -n kagenti-system deployment/otel-collector --tail=100

MLflow logs:

kubectl logs -n kagenti-system deployment/mlflow --tail=100

Agent logs:

kubectl logs -n team1 deployment/weather-service --tail=100

Filter for errors:

kubectl logs -n kagenti-system deployment/otel-collector --tail=500 | grep -iE "error|fail|warn"

Phase 2: Inspect Components

Check Pod Details

Describe failing pod:

kubectl describe pod <pod-name> -n <namespace>

Check container status:

kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'

Examine Configuration

List ConfigMaps:

kubectl get configmap -n kagenti-system

View specific ConfigMap:

kubectl get configmap otel-collector-config -n kagenti-system -o yaml

List Secrets (check existence, not values):

kubectl get secrets -n kagenti-system

Check secret keys:

kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data}' | jq 'keys'

Decode specific secret value:

kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data.<key>}' | base64 -d

Verify Connectivity

Service endpoints:

kubectl get endpoints -n kagenti-system

Test internal connectivity:

kubectl run -it --rm debug --image=curlimages/curl -- \
  curl -v http://mlflow.kagenti-system.svc.cluster.local:5000/health

Check routes (OpenShift):

kubectl get routes -A

Phase 3: Reproduce with Debugging

Run Failing Test

Set environment variables:

export CLUSTER=<suffix> WORKTREE=<worktree> MANAGED_BY_TAG=${MANAGED_BY_TAG:-kagenti-hypershift-custom}

Run specific test with verbose output:

KUBECONFIG=~/clusters/hcp/$MANAGED_BY_TAG-$CLUSTER/auth/kubeconfig \
  .worktrees/$WORKTREE/.github/scripts/local-setup/hypershift-full-test.sh $CLUSTER \
  --include-test --pytest-filter "<test_name>" --pytest-args "-v -s"

Watch Logs in Real-Time

Watch component logs (in separate terminal):

kubectl logs -f -n kagenti-system deployment/otel-collector

Or use stern for multiple pods:

stern -n kagenti-system .

Phase 4: Trace the Failure

Request Flow Analysis

For a typical agent request:

Client → Gateway → Agent → Tool → Agent → Gateway → Client
                     ↓
              OTEL Collector → MLflow

Check each hop:

Did the request reach the gateway?
Did the agent receive it?
Did the tool respond?
Were traces exported?
Did MLflow receive them?

Component-Specific Checks

OTEL Collector:

kubectl logs -n kagenti-system deployment/otel-collector | grep -i "span\|trace\|export"

MLflow:

kubectl logs -n kagenti-system deployment/mlflow | grep -i "trace\|experiment\|error"

Agent:

kubectl logs -n team1 deployment/weather-service | grep -i "request\|response\|error"

Auth Flow Analysis

Get Keycloak host:

KEYCLOAK_HOST=$(kubectl get route keycloak -n keycloak -o jsonpath='{.spec.host}')

Get OAuth client credentials:

CLIENT_ID=$(kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data.OIDC_CLIENT_ID}' | base64 -d)
CLIENT_SECRET=$(kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data.OIDC_CLIENT_SECRET}' | base64 -d)

Test OAuth token exchange:

curl -sk -X POST "https://$KEYCLOAK_HOST/realms/master/protocol/openid-connect/token" \
  -d "grant_type=client_credentials" \
  -d "client_id=$CLIENT_ID" \
  -d "client_secret=$CLIENT_SECRET"

Phase 5: Document Findings

## Root Cause Analysis

**Failure**: [Test name / description]
**Cluster**: [cluster suffix]
**Date**: [timestamp]

### Symptoms
- [What failed]
- [Error messages]

### Investigation Steps
1. [Step taken] → [Finding]
2. [Step taken] → [Finding]

### Root Cause
[Clear statement with evidence]

### Evidence

[Relevant log lines or command output]


### Fix
[Proposed solution]

### Verification
[How to confirm fix works]

CVE Check Before Publishing Findings

Before posting RCA findings to any public destination (issue comment, PR comment, etc.):

If the root cause involves a dependency bug, unexpected behavior, or version issue:

Invoke cve:scan to check if this is a known CVE
If a CVE is found → invoke cve:brainstorm BEFORE documenting findings publicly
Rewrite RCA documentation to use neutral language (no CVE IDs, no vulnerability descriptions)
Report the CVE through proper channels (see cve:brainstorm)

Quick Reference

Task	Command
Pod status	`kubectl get pods -A`
Pod logs	`kubectl logs -n <ns> deployment/<name>`
Describe pod	`kubectl describe pod <name> -n <ns>`
Get secret	`kubectl get secret <name> -n <ns> -o yaml`
Events	`kubectl get events -A --sort-by='.lastTimestamp'`
ConfigMap	`kubectl get configmap <name> -n <ns> -o yaml`

Related Skills

rca:ci - RCA from CI logs only
tdd:hypershift - Fix iteration after RCA
k8s:pods - Pod debugging
k8s:logs - Log analysis
k8s:health - Platform health check
k8s:live-debugging - Iterative debugging
auth:mlflow-oidc-auth - MLflow OIDC auth debugging
openshift:debug - Debug OpenShift-specific operators, SCCs, builds
openshift:routes - Debug route/ingress issues
cve:scan - CVE scanning (check if root cause is a known CVE)
cve:brainstorm - Disclosure planning (if CVE found during RCA)