rcahypershift

star 244

Root cause analysis with live cluster - full access to pods, logs, secrets, configs for deep investigation

kagenti By kagenti schedule Updated 2/25/2026

name: rca:hypershift description: Root cause analysis with live cluster - full access to pods, logs, secrets, configs for deep investigation

RCA-HyperShift: Root Cause Analysis with Live Cluster

flowchart TD
    START(["/rca:hypershift"]) --> P1["Phase 1: Observe"]:::rca
    P1 --> P2["Phase 2: Inspect"]:::rca
    P2 --> P3["Phase 3: Reproduce"]:::rca
    P3 --> P4["Phase 4: Trace"]:::rca
    P4 --> ROOT{"Root cause found?"}
    ROOT -->|Yes| P5["Phase 5: Document"]:::rca
    ROOT -->|No| P2
    P5 --> TDD["tdd:hypershift"]:::tdd

    classDef rca fill:#FF5722,stroke:#333,color:white
    classDef tdd fill:#4CAF50,stroke:#333,color:white

Follow this diagram as the workflow.

Systematic root cause analysis with full cluster access for deep investigation.

Context-Safe Execution (MANDATORY)

RCA with cluster access generates massive context pollution from kubectl describe, logs, configmap dumps, and secret inspection. ALL output MUST go to files.

# Session-scoped log directory
export LOG_DIR=/tmp/kagenti/rca/${WORKTREE:-$CLUSTER}
mkdir -p $LOG_DIR

Rules:

  1. ALL kubectl/oc commands redirect to $LOG_DIR/<name>.log
  2. ALL analysis happens in subagents: Task(subagent_type='Explore')
  3. Main context only sees: OK/FAIL status and subagent summaries
  4. Use subagents for verification too — "check if traces appear in $LOG_DIR/otel.log"

rca:hypershift vs rca:ci

Aspect rca:hypershift rca:ci
Access Full cluster (pods, logs, secrets, configs) CI logs only
Debugging Real-time with k8s:* skills Static log analysis
State Can inspect current state Historical artifacts only
When Have cluster, need deep investigation No cluster available

When to Use

  • rca:ci was inconclusive
  • Need to inspect live pod state, secrets, or configs
  • Want to reproduce failure with debugging enabled
  • Complex multi-component issues

Prerequisites

Auto-approved: All read operations on hosted clusters are auto-approved. Run each command separately for auto-approve to work.

Set cluster context:

export CLUSTER=<suffix> MANAGED_BY_TAG=${MANAGED_BY_TAG:-kagenti-hypershift-custom}
export KUBECONFIG=~/clusters/hcp/$MANAGED_BY_TAG-$CLUSTER/auth/kubeconfig

Verify connection (small output, OK inline):

kubectl get nodes

RCA Workflow

┌─────────────────────────────────────────────────────────────────┐
│  1. OBSERVE                                                     │
│     ├─ Check pod status                                         │
│     ├─ Get recent events                                        │
│     └─ Review current logs                                      │
├─────────────────────────────────────────────────────────────────┤
│  2. INSPECT                                                     │
│     ├─ Examine failing component                                │
│     ├─ Check configs and secrets                                │
│     └─ Verify connectivity                                      │
├─────────────────────────────────────────────────────────────────┤
│  3. REPRODUCE                                                   │
│     ├─ Run failing test with verbose output                     │
│     ├─ Watch logs in real-time                                  │
│     └─ Capture exact failure                                    │
├─────────────────────────────────────────────────────────────────┤
│  4. TRACE                                                       │
│     ├─ Follow request through components                        │
│     ├─ Identify where failure occurs                            │
│     └─ Determine root cause                                     │
├─────────────────────────────────────────────────────────────────┤
│  5. DOCUMENT                                                    │
│     ├─ Root cause with evidence                                 │
│     ├─ Reproduction steps                                       │
│     └─ Fix and verification plan                                │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Observe Current State

Pod Status

Check all pods:

kubectl get pods -A

Find problem pods:

kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

Check kagenti-system namespace:

kubectl get pods -n kagenti-system

Check team1 namespace:

kubectl get pods -n team1

Recent Events

Cluster-wide events:

kubectl get events -A --sort-by='.lastTimestamp' | tail -30

Namespace events:

kubectl get events -n kagenti-system --sort-by='.lastTimestamp'

Current Logs

OTEL Collector logs:

kubectl logs -n kagenti-system deployment/otel-collector --tail=100

MLflow logs:

kubectl logs -n kagenti-system deployment/mlflow --tail=100

Agent logs:

kubectl logs -n team1 deployment/weather-service --tail=100

Filter for errors:

kubectl logs -n kagenti-system deployment/otel-collector --tail=500 | grep -iE "error|fail|warn"

Phase 2: Inspect Components

Check Pod Details

Describe failing pod:

kubectl describe pod <pod-name> -n <namespace>

Check container status:

kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'

Examine Configuration

List ConfigMaps:

kubectl get configmap -n kagenti-system

View specific ConfigMap:

kubectl get configmap otel-collector-config -n kagenti-system -o yaml

List Secrets (check existence, not values):

kubectl get secrets -n kagenti-system

Check secret keys:

kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data}' | jq 'keys'

Decode specific secret value:

kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data.<key>}' | base64 -d

Verify Connectivity

Service endpoints:

kubectl get endpoints -n kagenti-system

Test internal connectivity:

kubectl run -it --rm debug --image=curlimages/curl -- \
  curl -v http://mlflow.kagenti-system.svc.cluster.local:5000/health

Check routes (OpenShift):

kubectl get routes -A

Phase 3: Reproduce with Debugging

Run Failing Test

Set environment variables:

export CLUSTER=<suffix> WORKTREE=<worktree> MANAGED_BY_TAG=${MANAGED_BY_TAG:-kagenti-hypershift-custom}

Run specific test with verbose output:

KUBECONFIG=~/clusters/hcp/$MANAGED_BY_TAG-$CLUSTER/auth/kubeconfig \
  .worktrees/$WORKTREE/.github/scripts/local-setup/hypershift-full-test.sh $CLUSTER \
  --include-test --pytest-filter "<test_name>" --pytest-args "-v -s"

Watch Logs in Real-Time

Watch component logs (in separate terminal):

kubectl logs -f -n kagenti-system deployment/otel-collector

Or use stern for multiple pods:

stern -n kagenti-system .

Phase 4: Trace the Failure

Request Flow Analysis

For a typical agent request:

Client → Gateway → Agent → Tool → Agent → Gateway → Client
                     ↓
              OTEL Collector → MLflow

Check each hop:

  1. Did the request reach the gateway?
  2. Did the agent receive it?
  3. Did the tool respond?
  4. Were traces exported?
  5. Did MLflow receive them?

Component-Specific Checks

OTEL Collector:

kubectl logs -n kagenti-system deployment/otel-collector | grep -i "span\|trace\|export"

MLflow:

kubectl logs -n kagenti-system deployment/mlflow | grep -i "trace\|experiment\|error"

Agent:

kubectl logs -n team1 deployment/weather-service | grep -i "request\|response\|error"

Auth Flow Analysis

Get Keycloak host:

KEYCLOAK_HOST=$(kubectl get route keycloak -n keycloak -o jsonpath='{.spec.host}')

Get OAuth client credentials:

CLIENT_ID=$(kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data.OIDC_CLIENT_ID}' | base64 -d)
CLIENT_SECRET=$(kubectl get secret mlflow-oauth-secret -n kagenti-system -o jsonpath='{.data.OIDC_CLIENT_SECRET}' | base64 -d)

Test OAuth token exchange:

curl -sk -X POST "https://$KEYCLOAK_HOST/realms/master/protocol/openid-connect/token" \
  -d "grant_type=client_credentials" \
  -d "client_id=$CLIENT_ID" \
  -d "client_secret=$CLIENT_SECRET"

Phase 5: Document Findings

## Root Cause Analysis

**Failure**: [Test name / description]
**Cluster**: [cluster suffix]
**Date**: [timestamp]

### Symptoms
- [What failed]
- [Error messages]

### Investigation Steps
1. [Step taken] → [Finding]
2. [Step taken] → [Finding]

### Root Cause
[Clear statement with evidence]

### Evidence

[Relevant log lines or command output]


### Fix
[Proposed solution]

### Verification
[How to confirm fix works]

CVE Check Before Publishing Findings

Before posting RCA findings to any public destination (issue comment, PR comment, etc.):

If the root cause involves a dependency bug, unexpected behavior, or version issue:

  1. Invoke cve:scan to check if this is a known CVE
  2. If a CVE is found → invoke cve:brainstorm BEFORE documenting findings publicly
  3. Rewrite RCA documentation to use neutral language (no CVE IDs, no vulnerability descriptions)
  4. Report the CVE through proper channels (see cve:brainstorm)

Quick Reference

Task Command
Pod status kubectl get pods -A
Pod logs kubectl logs -n <ns> deployment/<name>
Describe pod kubectl describe pod <name> -n <ns>
Get secret kubectl get secret <name> -n <ns> -o yaml
Events kubectl get events -A --sort-by='.lastTimestamp'
ConfigMap kubectl get configmap <name> -n <ns> -o yaml

Related Skills

  • rca:ci - RCA from CI logs only
  • tdd:hypershift - Fix iteration after RCA
  • k8s:pods - Pod debugging
  • k8s:logs - Log analysis
  • k8s:health - Platform health check
  • k8s:live-debugging - Iterative debugging
  • auth:mlflow-oidc-auth - MLflow OIDC auth debugging
  • openshift:debug - Debug OpenShift-specific operators, SCCs, builds
  • openshift:routes - Debug route/ingress issues
  • cve:scan - CVE scanning (check if root cause is a known CVE)
  • cve:brainstorm - Disclosure planning (if CVE found during RCA)
Install via CLI
npx skills add https://github.com/kagenti/kagenti --skill rcahypershift
Repository Details
star Stars 244
call_split Forks 89
navigation Branch main
article Path SKILL.md
More from Creator