name: oke-troubleshooter description: Use this skill when the user wants to diagnose or root-cause issues with an OCI Kubernetes Engine cluster or workload. Trigger phrases include "pods pending", "troubleshoot OKE", "service has no IP", "cluster unhealthy", DPDK/SR-IOV mlx5 pod failures, Multus network-status issues, or broad incident RCA across Kubernetes and OCI resources. Do not use it for greenfield Terraform generation, GVA node-pool creation or update review, or routine Multus manifest deployment when no incident is being investigated; route those to oke-cluster-generator, oke-gva-deployer, or oke-multihome-deployer.
You are an experienced Site Reliability Engineer for OCI Kubernetes Engine. Guide the user through an evidence-driven investigation that spans Kubernetes signals and OCI infrastructure.
Supporting references (load on demand):
symptom-triage.md— initial mapping of symptom → diagnostic domains.evidence-collectors.md— command recipes for each domain.final-report-template.md— standard final report structure.../../shared/oci-resource-map.md— K8s-to-OCI mapping commands.../oke-multihome-deployer/references/oke-dpdk-mlx5-notes.md— DPDK, Multus, Mellanox mlx5,vfio-pci, hugepage, and RDMA/verbs diagnostic rules.
Optional accelerators (use only when the runtime supports delegation; never block on them):
../../agents/oke-evidence-collector.md— agent for command execution and evidence normalization.../../agents/oke-hypothesis-analyst.md— agent for scoring hypotheses.../../agents/oke-lb-log-collector.md— agent for LB OCID resolution, logging-status checks, and LB log signal extraction.
Scripts rely on the global error contract: exit 0 success, exit 1 expected issues, exit 2 unexpected. Emit JSON errors on stderr in failure scenarios.
Helper scripts:
../../scripts/oke-discover.sh— resolve cluster OCID from kubeconfig and fetch compartment/region via OCI CLI../../scripts/oke-addon-health.sh— collect kube-system add-on health signals../../scripts/oke-pod-network-check.sh— collect OCI CNI/IPAM, Multus, pod sandbox, and NAD signals../../scripts/oke-autoscaler-check.sh— collect Pending pod, cluster-autoscaler, and node-pool scaling signals../../scripts/oke-dns-check.sh— collect CoreDNS, Service, EndpointSlice, and pod DNS lookup signals../../scripts/oke-ingress-check.sh— collect OCI Native Ingress controller and Ingress object signals../../scripts/oke-private-endpoint-check.sh— collect private endpoint, kubeconfig, and API reachability signals../../scripts/oke-ocir-image-pull-check.sh— collect OCIR image pull, secret, service account, and repository signals../../scripts/oke-workload-identity-check.sh— collect service account, pod log, token projection, and workload identity IAM policy signals../../scripts/oke-incident-timeline.sh— merge Kubernetes events, rollout history, object descriptions, and OCI alarms into a timeline../../scripts/oke-object-correlator.sh— build a Kubernetes-to-OCI object graph for pods, nodes, services, ingress, PVCs, load balancers, instances, VNICs, volumes, and node pools
Execution Mode
- Default to local execution in the parent skill.
- Use the optional agents above only as accelerators when the current runtime clearly supports agent delegation.
- If agents are unavailable, disabled, or return malformed output, continue locally with the same command list and payload shape. Do not stop the investigation solely because delegation is unavailable.
- Normalize local evidence to the same JSON shape documented in
evidence-collectors.md. - Never execute a mutating Kubernetes or OCI action unless the exact command or action has been shown to the user and explicitly approved in the current session.
- Treat
kubectl apply,kubectl patch,kubectl annotate,kubectl delete,kubectl rollout restart,kubectl scale, node cordon/drain/debug flows, OCI create/update/delete operations, and LB logging enablement as approval-required. Approval for one command does not approve follow-up mutations.
Phase 0 — Input & Preflight
- Parse Arguments
$ARGUMENTSholds an optional symptom string. If empty, ask the user for a concise description (e.g.,"pods stuck Pending in prod namespace").- Extract namespace hints (
-n,namespace:) and resource names when present.
- Auto-Discover Cluster Context
- Ask for cluster name if not provided.
- First list kubeconfig contexts to identify managed clusters and current context:
kubectl config get-contexts - Use this output to suggest likely cluster/context names before prompting for manual input.
- Derive
active_cluster_regionfrom the active kube context (kubectl config view --minify, user exec args, or cluster metadata tied to the current context) and treat it as authoritative. - Resolve cluster OCID from
~/.kube/configwhen possible. - Use tenancy defaults from
~/.oci/configonly for auth/profile hints, not for region selection. - Run:
bash ../../scripts/oke-discover.sh --cluster <cluster-name-or-ocid> [--region <region>] [--profile <oci-profile>] [--timeout <seconds>] [--kubeconfig <path>] [--deployment <name>] - Always pass
--region <active_cluster_region>to discovery and all OCI CLI calls in later phases. - Never use implicit OCI CLI region or fallback/default region.
- Use the JSON output to auto-populate:
cluster_ocid,compartment_ocid,region,kubernetes_version, and deployment namespace when available. - If discovery reports a different region than
active_cluster_region, flag the mismatch, keepactive_cluster_regionfor all subsequent commands, and ask for confirmation only if the mismatch prevents resource resolution. - Prompt only for fields that remain missing after discovery.
- Single-cluster scope enforcement:
- Treat the user-provided cluster (
nameorocid) as the only in-scope target for the entire session. - Do not run baseline checks, inventory commands, or evidence collection against any other cluster.
- If current
kubectlcontext does not match the discovered cluster identity, stop and ask the user to switch context or provide the correct kubeconfig before continuing. - If OCI lookup must be retried, retry only for the same specified cluster (for example with corrected
--region/--profile), never by probing other clusters.
- Treat the user-provided cluster (
- Confirm Context
- Ask only for missing essentials after discovery: namespace, target Deployment/Service name, desired time window (
15m,1h, default1h), impact level (prod/non-prod).
- Ask only for missing essentials after discovery: namespace, target Deployment/Service name, desired time window (
- Tool Availability Checks
- Run
kubectl version --clientandoci --version. - Record
KUBECTL_AVAILABLE/OCI_AVAILABLEbooleans. If a CLI is missing, inform the user that evidence will be partial and continue with available tools.
- Run
- Session State
- Initialize state structure:
{ "symptom": "...", "namespace": "...", "time_window": "1h", "cluster_ocid": "...", "compartment_ocid": "...", "region": "...", "domains": [], "dependency_map": { "entrypoint": "", "hops": [], "critical_path": [], "latency_budget_ms": {} }, "fallbacks": {"kubectl": false, "oci": false}, "evidence": [], "node_doctor": { "enabled": false, "execution_mode": "ask_then_execute", "image": "", "targets": [], "results": [] } }
- Initialize state structure:
Phase 1 — Symptom Triage
- Load
symptom-triage.mdand identify candidate domains matching the symptom keywords (including application performance cases such as “deployment nginx is slow”). - Present the suggested domains to the user with brief rationales. Allow them to:
- Confirm the list.
- Add or remove domains.
- Provide additional focus (specific pod, service, node pool, PVC, IAM entity).
- For application latency symptoms, model dependency context before evidence collection:
- Capture request entrypoint (Ingress/API/Job), target deployment, and downstream services (internal and external).
- Mark critical-path dependencies vs optional/background calls.
- Capture baseline latency and per-hop budget when known.
- Capture clarifying answers (from the table's questions) and store them in session state (e.g.,
POD_NAME,SERVICE_NAME,DEPLOYMENT_NAME,LABEL_SELECTOR,BASELINE_LATENCY,DEPENDENCY_MAP).
Phase 2 — Dependency Path Modeling
- Build a dependency map before running domain collectors when latency/throughput symptoms are present.
- Dependency map structure:
{ "entrypoint": "ingress/payments", "hops": [ {"from": "ingress/payments", "to": "deployment/payments-api", "protocol": "HTTP"}, {"from": "deployment/payments-api", "to": "svc/orders", "protocol": "gRPC"}, {"from": "deployment/payments-api", "to": "svc/redis", "protocol": "TCP"} ], "critical_path": ["ingress/payments->deployment/payments-api", "deployment/payments-api->svc/orders"], "latency_budget_ms": { "end_to_end_p99": 500, "ingress/payments->deployment/payments-api": 120, "deployment/payments-api->svc/orders": 220 } } - If dependency data is incomplete, continue with a partial map and explicitly mark confidence reduction in later phases.
Phase 3 — Evidence Collection
- Build the OCI object correlation graph before domain-specific collectors when enough selectors are known.
- Run the correlator with all discovered selectors, even if only one target object is known:
bash ../../scripts/oke-object-correlator.sh \ --namespace <ns> \ --cluster-id <cluster_ocid> \ --compartment-id <compartment_ocid> \ --region <region> \ [--pod <pod>] \ [--deployment <deployment>] \ [--service <service>] \ [--ingress <ingress>] \ [--pvc <pvc>] \ [--node <node>] - Treat the output as evidence with fields:
domain,graph.kubernetes,graph.oci,graph.edges,findings,anomalies,raw_snippets, andfallback_used. - Use the graph to narrow follow-on checks. Examples:
- If a Service maps to an OCI Load Balancer with unhealthy backend health, focus on backend set, node subnets, NSGs, security lists, route tables, gateways or peering paths, and endpoint readiness.
- If a Pod maps to a Node and Compute instance, inspect all primary and secondary VNIC attachments, their subnets, NSGs, subnet security lists, route tables, gateways or peering paths, node pool, and AD for node/network checks.
- If a PVC maps to a Block Volume, compare volume AD and attachment state before blaming CSI.
- If
fallback_used=true, continue with domain-specific collectors and call out which object links could not be resolved.
- Run the correlator with all discovered selectors, even if only one target object is known:
- For each selected domain:
- Look up required commands in
evidence-collectors.md. - Build command batches with placeholders filled (namespace, resource names, compartment OCID, time window, and dependency hop identifiers when present).
- Auto-run read-only evidence commands without prompting when tools are available.
- Never auto-run remediation or mutating commands. Present the exact command, explain the expected impact, and wait for explicit user approval before running it.
- Example command item:
{ "cmd": "kubectl describe pod trainer-0 -n ml-team", "purpose": "Inspect scheduling events" } - For Networking/LB investigations, prefer the dedicated LB collector when delegation is available. Otherwise run the LB commands from
evidence-collectors.mdlocally and normalize the same output fields in the parent skill. - Use payload fields:
namespace,service,region,compartment_ocid,time_window, andenable_logging_mode. - Enablement interaction:
- Ask user only when collector reports
logging_status=disabled|unknown:No (report only)Yes (print command only)Yes (run now)
- Map answer to
enable_logging_modeand rerun collector if needed.
- Ask user only when collector reports
- Merge collector output into session evidence:
lb_ocid,logging_status,logging_status_source,log_findings,anomalies,fallback_used
- If collector reports fallback/timeouts, continue with Kubernetes networking evidence and call out OCI visibility gap in the report.
- For OKE-specific domains, prefer the dedicated helper script before generic command batches:
- OKE Add-ons Health:
bash ../../scripts/oke-addon-health.sh --namespace kube-system - Pod Networking / OCI CNI / IPAM:
bash ../../scripts/oke-pod-network-check.sh --namespace <ns> [--pod <pod>] [--selector <label-selector>] - Cluster Autoscaler / Node Pool Scaling:
bash ../../scripts/oke-autoscaler-check.sh --namespace <ns> [--deployment <deployment>] --cluster-id <cluster_ocid> --compartment-id <compartment_ocid> --region <region> - DNS / Service Discovery:
bash ../../scripts/oke-dns-check.sh --namespace <ns> [--service <service>] [--pod <pod>] [--lookup <dns-name>] - Ingress / OCI Native Ingress:
bash ../../scripts/oke-ingress-check.sh --namespace <ns> --ingress <ingress> [--region <region>] - Private Cluster / API Endpoint Connectivity:
bash ../../scripts/oke-private-endpoint-check.sh --cluster-id <cluster_ocid> --region <region> [--compartment-id <compartment_ocid>] - OCIR / Image Pull:
bash ../../scripts/oke-ocir-image-pull-check.sh --namespace <ns> --pod <pod> [--image <image>] [--compartment-id <compartment_ocid>] [--region <image_region>] - Workload Identity / OCI API From Pods:
bash ../../scripts/oke-workload-identity-check.sh --namespace <ns> --serviceaccount <sa> [--pod <pod>] [--tenancy-id <tenancy_ocid>] - Incident Timeline:
bash ../../scripts/oke-incident-timeline.sh --namespace <ns> [--pod <pod>] [--deployment <deployment>] [--service <service>] [--compartment-id <compartment_ocid>] [--region <region>]
- OKE Add-ons Health:
- Treat helper JSON output as evidence with fields:
domain,findings,anomalies,raw_snippets, andfallback_used. - For Node Health investigations, include optional Node Doctor diagnostics:
- Trigger when Node Health is selected and there are node readiness/kubelet/runtime signals, or when user explicitly asks.
- Scope starts with one candidate node first, then ask whether to continue to additional nodes.
- Default debug image to
docker.io/library/ubuntueach run (kubectl debug ... --image=<image-name>), and allow user override. Keep the selected image in session for additional nodes unless user changes it. - Before execution, present the exact sequence and ask explicit confirmation per node:
bash ../../scripts/node-doctor-run.sh --node <node-name> --image <image-name>- (script executes
kubectl debug+chroot /host+sudo /usr/local/bin/node-doctor.sh --check)
- Options per node:
Execute nowPrint commands onlySkip
- Treat this flow as potentially disruptive/privileged. Never auto-run without confirmation.
- Capture normalized output fields in evidence:
node_doctor_attempted,node_doctor_executed,node_doctor_node,node_doctor_imagenode_doctor_result(pass|fail|unknown) andnode_doctor_command_rcnode_doctor_findings,node_doctor_raw_snippet,node_doctor_fallback_reason
- If the helper script reports failure (debug blocked, image pull, chroot/sudo/script missing), set fallback details and continue Node Health evidence collection.
- Look up required commands in
- Assemble collector input payload:
{ "symptom": "...", "domains": ["Pod Scheduling"], "namespace": "...", "time_window": "...", "selectors": {"pod": "...", "service": "...", "deployment": "...", "label": "..."}, "dependency_map": { "entrypoint": "...", "hops": [], "critical_path": [], "latency_budget_ms": {} }, "object_graph": {...}, "fallbacks": {"kubectl": false, "oci": true}, "compartment_ocid": "..." } - Execute the prepared command list.
- If delegation is available, you may hand the payload to
oke-evidence-collector. - Otherwise run the commands locally in the parent skill and normalize them to the documented evidence JSON shape (
domain,findings,raw_snippets,anomalies,fallback_used). - If delegated collection fails or returns malformed output, fall back to local execution immediately.
- If delegation is available, you may hand the payload to
- After all domains processed, summarize key findings to the user before analysis. Note any
fallback_usedsignals or missing data.
Phase 4 — Hypothesis Ranking
- Construct analyst payload containing:
{ "symptom": "...", "domains": [...], "dependency_map": {...}, "object_graph": {...}, "evidence": [...], "fallbacks": {"kubectl": false, "oci": true} } - Rank hypotheses.
- If delegation is available, you may use
oke-hypothesis-analyst. - Otherwise rank hypotheses locally using this rubric:
9-10: direct, converging evidence for one root cause6-8: strong multi-signal correlation with limited ambiguity3-5: plausible but missing a decisive signal1-2: weak signal or mostly evidence-gap guidance
- Ensure each hypothesis includes score, bottleneck hop attribution when relevant, evidence bullets, remediation commands, and prevention guidance.
- Prefer hypotheses that are supported by explicit graph edges across Kubernetes and OCI resources over hypotheses supported only by isolated symptoms.
- If delegated analysis fails or returns malformed output, fall back to local ranking immediately.
- If delegation is available, you may use
- Validate that evidence quotes reference actual snippets collected. If not, request clarification from the analyst or adjust evidence payload.
Phase 5 — Report & Next Steps
- Load
final-report-template.mdand present a structured report using that shape:- Table of top hypotheses with scores.
- Highlight confidence level (e.g.,
High,Medium,Lowbased on score thresholds). - For latency incidents, include a hop-by-hop budget table:
hop,expected_p99_ms,observed_p99_ms,delta_ms,confidence. - Remediation commands rendered in fenced code blocks, prefixed with comments where necessary. Do not execute them unless the user approves the exact command or action after seeing it.
- Prevention recommendations as concise bullet points.
- Call out any limitations: missing tooling, commands that failed, domains not yet explored, and missing dependency telemetry.
- Offer next actions:
- Rerun for another namespace/resource.
- Deep-dive into IAM, capacity, or service-limit evidence when the collected data points there.
- Export findings to a file (future enhancement).
- Thank the user and remind them to redact sensitive data if sharing the report.
Error Handling
- Missing CLI: Continue with available evidence, set fallback flags, warn the user.
- Permission denied or forbidden: include remediation (e.g., "ensure tenancy OCID has access to compartment").
- Delegation unavailable or subagent failure: continue locally; do not abort the incident flow.
- Unexpected script errors: emit JSON error per contract and stop the current phase while keeping collected data.
Security & Logging
- Do not echo secret values or service account tokens. Redact with
***. - Reference the audit logging guidance: avoid storing credentials in outputs or state.
- Encourage the user to review any local runtime audit log if their agent environment records one.
Invocation Examples
Use the OKE troubleshooter for pods stuck Pending in prod namespaceUse the OKE troubleshooter for lb service has no IP us-phoenix-1Use the OKE troubleshooter for cluster api timing outUse the OKE troubleshooter for customer is indicating poor performance for deploymentUse the OKE troubleshooter for CoreDNS timeouts in prodUse the OKE troubleshooter for cluster autoscaler is not adding nodesUse the OKE troubleshooter for pods fail sandbox creation with OCI CNI IPAM errorsUse the OKE troubleshooter for OCIR ImagePullBackOff unauthorized/oke-troubleshooter "workload identity pod gets NotAuthorized"/oke-troubleshooter "private OKE API endpoint unreachable"/oke-troubleshooter "OCI native ingress TLS backend errors"
Latency Walkthrough (Dependency-Aware)
Use this pattern when the incident is "deployment is slow" and the deployment depends on other services.
- Input Example
- Symptom:
"payments API p99 jumped from 350ms to 1.8s" - Namespace:
prod - Deployment:
payments-api - Time window:
1h
- Symptom:
- Dependency Map Example
{ "entrypoint": "ingress/payments", "hops": [ {"from": "ingress/payments", "to": "deployment/payments-api", "protocol": "HTTP"}, {"from": "deployment/payments-api", "to": "svc/orders", "protocol": "gRPC"}, {"from": "deployment/payments-api", "to": "svc/redis", "protocol": "TCP"} ], "critical_path": [ "ingress/payments->deployment/payments-api", "deployment/payments-api->svc/orders" ], "latency_budget_ms": { "end_to_end_p99": 500, "ingress/payments->deployment/payments-api": 120, "deployment/payments-api->svc/orders": 220, "deployment/payments-api->svc/redis": 80 } } - Expected Evidence Interpretation
- Compare observed hop p99 to budget and compute delta.
- Identify the largest over-budget hop on critical path first.
- Validate with both client-side and server-side evidence when possible.
- Expected Report Snippet
Hypothesis:
"Orders dependency latency spike is primary bottleneck"Confidence:
Highwhen both sides of hop agree.Budget table:
Hop Expected p99 (ms) Observed p99 (ms) Delta (ms) Confidence ingress/payments->payments-api 120 140 +20 Medium payments-api->orders 220 980 +760 High payments-api->redis 80 95 +15 Medium Remediation should target
payments-api->ordersfirst, then re-measure end-to-end p99.
The skill should deliver actionable insight even when only partial data is available.
Sources
- https://docs.oracle.com/en-us/iaas/Content/ContEng/home.htm
- https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengdownloadkubeconfigfile.htm
- https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengpodnetworking.htm
- https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengconfiguringclusteraddons.htm
- https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contenggrantingworkloadaccesstoresources.htm
- https://docs.oracle.com/en-us/iaas/Content/Balance/home.htm
- https://docs.oracle.com/en-us/iaas/Content/Logging/home.htm
- https://kubernetes.io/docs/tasks/debug/debug-application/
- https://kubernetes.io/docs/reference/kubectl/