debug

star 12

Diagnose problems with an installed LlamaCloud release — pods crashing, services not reachable, login broken, parses failing, OCR queue stuck, etc. Use when the user says "something's wrong", "pods are stuck", "X is failing", "the install came up but something is broken", or pastes a kubectl error. Performs a structured symptom-to-cause walk using read-only kubectl and cloud CLI commands, then matches against the LlamaCloud-specific pattern library in `patterns.md`. Produces a `debug-report.md` safe to share with LlamaIndex support.

run-llama By run-llama schedule Updated 4/30/2026

name: debug description: Diagnose problems with an installed LlamaCloud release — pods crashing, services not reachable, login broken, parses failing, OCR queue stuck, etc. Use when the user says "something's wrong", "pods are stuck", "X is failing", "the install came up but something is broken", or pastes a kubectl error. Performs a structured symptom-to-cause walk using read-only kubectl and cloud CLI commands, then matches against the LlamaCloud-specific pattern library in patterns.md. Produces a debug-report.md safe to share with LlamaIndex support.

Debug a running LlamaCloud release

Goal: take "something is broken" and produce a structured diagnosis: which pod, what symptom, which values.yaml field (or external dependency) is the root cause, and what to change. Read-only against the cluster.

For known error signatures, the catalog is in patterns.md next to this file. Read it lazily — only when you've narrowed down to a candidate symptom — to keep this skill's context budget small.

Step 0 — Confirm context

kubectl config current-context
helm list -A | grep -i llama

Read the context back to the user and confirm. Identify the release(s). If multiple, ask which one.

Step 1 — Health snapshot

helm status <release> -n <ns>
helm history <release> -n <ns> | tail -10
kubectl get pods -n <ns> -o wide
kubectl get events -n <ns> --sort-by=.lastTimestamp | tail -50
kubectl get pdb,hpa,ingress,svc,pvc -n <ns>

Tabulate the pod list. Note the deployment naming convention: the LlamaCloud components have fixed deployment names (the chart does not prefix them with the release name), while subchart components do use the <release>- prefix.

Group Deployment / pod prefix What's normal
Backend llamacloud-* (yes — bare release-named, not llamacloud-backend) 1+ Ready
Frontend llamacloud-web-* 1+ Ready
Jobs service llamacloud-operator-* 1+ Ready
Jobs worker llamacloud-worker-* 1+ Ready
Usage / telemetry llamacloud-telemetry-* 1 Ready
Parse llamacloud-parse-* 1+ Ready
Parse OCR (GPU) llamacloud-ocr-* 0+ Ready depending on config.parseOcr.enabled
Layout detection (GPU) llamacloud-layout-* 0+ depending on config.parseLayoutDetection.enabled
Layout V3 (GPU) llamacloud-layout-v3-* 0+ depending on config.parseLayoutDetectionV3.enabled
Parse Temporal worker llamacloud-temporal-parse-* 1+ Ready when temporal workloads enabled
Jobs Temporal worker temporal-jobs-worker-* 1+ Ready when temporal workloads enabled
Temporal subchart <release>-temporal-subchart-* frontend / history / matching / worker / web / admintools all Ready
llama-agents subchart varies optional

When the user gives a pod name, treat the bare llamacloud-<hash>-<random> as the backend unambiguously. This trips up newcomers because the obvious guess is "frontend".

To unambiguously isolate one component, use label selectors instead of name globs:

kubectl get pods -n <ns> -l app.kubernetes.io/name=llamacloud           # backend only
kubectl get pods -n <ns> -l app.kubernetes.io/name=llamacloud-web        # frontend
kubectl get pods -n <ns> -l app.kubernetes.io/name=llamacloud-worker     # jobs worker
kubectl get pods -n <ns> -l app.kubernetes.io/name=llamacloud-parse      # parse
# ...etc — every component has matching app.kubernetes.io/name

When <component>.horizontalPodAutoscalerSpec is set on a component (default: empty, no HPA), the chart creates a standard HorizontalPodAutoscaler named after the Deployment (e.g. llamacloud, llamacloud-web). If you see HPAs with different naming (e.g. keda-hpa-*-scaledobject), that's the customer's KEDA / external autoscaler managing the same Deployments — the Deployment/<name> reference is authoritative regardless of how the HPA was named.

Standard Secrets the chart creates (look for these when debugging missing-secret errors). Each is generated only when the corresponding values block is configured inline; if the user sets the matching *.secret: reference instead, the chart skips generating its own Secret and reads from the user-provided one.

Always created (when the corresponding values block is set inline):

Secret Triggered by Purpose
llamacloud-license-key license.key License
postgresql-secret postgresql.password Postgres credentials
mongodb-secret mongodb.password or mongodb.mongodb_url MongoDB credentials / connection URL
rabbitmq-secret rabbitmq.password or rabbitmq.connectionString RabbitMQ credentials
redis-secret redis.password Redis credentials
bucket-secret config.storageBuckets.* credentials Object storage credentials (for non-IRSA setups, e.g. static AWS keys, GCS HMAC, Azure SAS)

Conditional — only when the matching feature is enabled:

Secret Triggered by Purpose
oidc-secret config.authentication.oidc.enabled: true OIDC client credentials
basic-auth-secret basic auth enabled (mutually exclusive with OIDC) Basic auth credentials
temporal-postgresql-secret temporal.deploy: true Separate Postgres credential copy for the Temporal subchart
qdrant-secret qdrant.enabled: true Qdrant API key
s3proxy-secret config.storageBuckets.s3proxy.enabled: true s3proxy upstream credentials (e.g. GCS HMAC, Azure storage key)

LLM provider Secrets — one per configured provider:

Secret Triggered by Purpose
openai-api-key-secret config.llms.openAi.apiKey set OpenAI API key
anthropic-api-key-secret config.llms.anthropic.apiKey set Anthropic API key
gemini-api-key-secret config.llms.gemini.apiKey set Google Gemini API key
azure-open-ai-api-key-secret config.llms.azureOpenAi.deployments[] configured Azure OpenAI deployment keys (one Secret holding keys for all configured deployments)
aws-bedrock-api-key-secret config.llms.awsBedrock block present AWS Bedrock config (model versions; auth via IRSA, not a static key, but the Secret still carries config values)
google-vertex-ai-api-key-secret config.llms.googleVertexAi.credentialsJson set Vertex AI service-account JSON
llm-provider-configs-secret always (when any LLM provider configured) Aggregated provider config — base URLs, model names, deployment names, etc. — referenced by backend env vars

Ingress TLS (NOT created by the chart):

Secret Triggered by Purpose
<ingress.tlsSecretName> ingress.enabled: true references it Ingress TLS. The chart does not create this Secret; the user provides it (cert-manager, externally-managed cert, or kubectl create secret tls).

ConfigMaps the chart creates:

Centralized config (always present when chart deployed):

ConfigMap Purpose
common-config Shared env shared across most deployments
urls-config Intra-cluster Service URLs each component talks to
bucket-config Object storage bucket names and provider
feature-config Feature flags
extract-config Extract-specific tuning
concurrency-config Concurrency limits per workload
rate-limits-config Rate limits per LLM / API
temporal-connection-config Temporal frontend host/port

Conditional:

ConfigMap Triggered by Purpose
llama-agents-config llamaAgents.deploy: true or llamaAgents.controlPlaneUrl set llama-agents wiring
s3proxy-config config.storageBuckets.s3proxy.enabled: true s3proxy JCLOUDS configuration (provider, region, endpoint)

Per-deployment ConfigMaps (one per component, named after the Deployment):

llamacloud-web, llamacloud-worker, llamacloud-parse, llamacloud-temporal-parse, temporal-jobs-worker, etc. — these hold component-specific env. When debugging a single component's environment, this is the ConfigMap to read.

Temporal subchart ConfigMaps (when temporal.deploy: true):

<release>-temporal-subchart-dockerize-config, <release>-temporal-subchart-dynamic-config — Temporal server configuration.

For each pod that is not Ready or has restarted, capture the snapshot row.

Step 2 — Symptom routing

For each unhealthy pod, classify by status.phase + containerStatuses[].state + last event:

Symptom First action Then
Pending, FailedScheduling kubectl describe pod → read Events: check node capacity, taints, GPU availability
Pending, Unschedulable due to PVC kubectl get pvc -n <ns> and kubectl get pv StorageClass / PV provisioner issue
Waiting: ImagePullBackOff / ErrImagePull kubectl describe pod → read pull error imagePullSecrets, image tag existence, registry reachability
Waiting: CreateContainerConfigError kubectl describe pod → secret/configmap reference named secret or key missing; check <release>-*-env Secrets
Waiting: CrashLoopBackOff kubectl logs <pod> --previous match against patterns.md
Running, 0/1 Ready, restartCount stable readiness probe failing kubectl describe pod → readiness probe + check Service endpoints
Running, 0/1 Ready, restartCount climbing liveness probe killing it kubectl describe pod → liveness probe + recent logs
Terminated: OOMKilled check resources.limits.memory vs kubectl top pod bump limit or downsize work
Terminated: Error, exit code 137 similar to OOM (signal 9) check node memory pressure
Terminated: Error, exit code 143 SIGTERM during graceful shutdown — usually rollout, not a bug only worry if it's an active loop

For pods in CrashLoopBackOff, always grab:

kubectl logs <pod> -n <ns> --all-containers --previous --tail=200
kubectl logs <pod> -n <ns> --all-containers --tail=100
kubectl describe pod <pod> -n <ns>

Then read the first 5–10 ERROR/FATAL lines and look them up in patterns.md.

Step 3 — Cross-pod / network checks

Some failures aren't a single bad pod, they're a routing problem between pods. Common ones:

# Service endpoints — if empty, no Pods match the Service selector or none are Ready
kubectl get endpoints -n <ns>

# DNS from inside a healthy pod (if any exist)
# Pick a Running pod; nslookup of dependency hostnames and sibling Services.
# NOTE: the LlamaCloud backend Service is named `llamacloud` (not
# `<release>-backend`); see the deployment-naming table above. Subchart
# Services like temporal DO use the `<release>-` prefix.
kubectl exec -n <ns> <healthy-pod> -- nslookup llamacloud
kubectl exec -n <ns> <healthy-pod> -- nslookup llamacloud-web
kubectl exec -n <ns> <healthy-pod> -- nslookup <release>-temporal-subchart-frontend
# (kubectl exec is read-only-equivalent for diagnosis; ask user before running this)

# NetworkPolicy egress
kubectl get networkpolicy -A
# The chart itself ships only ONE NetworkPolicy (`<release>-agent-backend-egress`,
# in the namespace configured by `llama-agents-subchart.apps.namespace` (default
# `llama-agents`), only when llamaAgents.deploy is true and
# llamaAgents.allowBackendEgress is true). It allows egress from agent pods TO
# the backend, not the other way around. Any other NetworkPolicy you see is
# customer-managed and may be blocking pod-to-pod traffic.

If kubectl exec is sensitive in the customer's environment (it usually is in production), ask before running and explain what you're checking. If declined, continue with describe/logs only.

Step 4 — Cloud-side checks (only if relevant)

If the symptom points at a managed dependency (DB connection refused, S3 access denied, OIDC discovery failing), and the cluster is on AWS/Azure/GCP, run the same read-only cloud checks the preinstall-check skill uses (Phase 4). Specifically:

  • DB unreachable → aws ec2 describe-security-groups cross-check between cluster SG and DB SG
  • S3 access denied → aws iam get-role --role-name <irsa-role> and check the bucket policy / IRSA trust relationship
  • OIDC failures → curl <discoveryUrl> from the user's laptop and from a debug pod (with consent)

Ask before each cloud CLI call. Read-only by definition.

Step 5 — Match against patterns.md

patterns.md (next to this file) catalogs the highest-frequency LlamaCloud-specific failure signatures and maps each to:

  • The error message you'll see in logs
  • The likely root cause
  • The values.yaml field to change (or external action to take)
  • A pointer to the relevant examples/*.yaml

Read it now if any captured log line matches a pattern entry. Only read the relevant section, not the whole file — patterns.md is organized by component to make this efficient.

Step 6 — Write debug-report.md

Always write the report. This is what the user sends to support if you can't fix it.

# LlamaCloud debug report — <ISO timestamp UTC>

## Summary
<one-sentence statement of what's wrong, e.g. "3 backend pods in CrashLoopBackOff, all failing on Postgres connection refused. Most likely cause: cluster SG missing from RDS inbound rules.">

## Inputs
- Release: <name>
- Namespace: <ns>
- Chart version: <`helm get metadata <release>` output>
- Cluster context: <ctx>
- Cluster cloud: <aws|azure|gcp|other>

## Pod state
<table of all pods with phase / ready / restarts / age>

## Failing pods — evidence
<for each failing pod>:
  ### <pod-name>
  Symptom: <one-line>
  Container state: <Waiting/CrashLoopBackOff/etc>
  Recent events: <bulleted, sorted by time>
  Last log lines (previous container): <fenced code block, redacted>
  Matched pattern: <patterns.md anchor or "no match">
  Suggested fix: <concrete>

## Recent helm history
<helm history output>

## Cluster events (last hour)
<filtered event list>

## Cloud routing checks
<if run; what was confirmed / what was off>

## Recommended next steps
1. <highest-priority action>
2. <second-priority action>
3. ...

## What to send to support
This report (after a final secret review). Specifically:
- The above
- The output of `kubectl logs <pod> --previous --tail=500` for each failing pod (already captured above; re-run if more context needed)
- `helm get values <release> -n <ns> --revision <rev>` with secrets masked

Mask all secrets before saving. Run a final regex sweep over the report content for known secret patterns: password, apiKey, clientSecret, key, token, connectionString, credentialsJson, JWT-shaped strings, AWS access key ids, RSA/EC PEM headers. If unsure, mask.

Critical rules

  • Read-only. No kubectl edit, kubectl patch, kubectl delete, kubectl rollout restart, helm upgrade, helm rollback. Diagnose and recommend; do not "just try restarting the pod."
  • Ask before kubectl exec. It's technically read-only, but customers are sensitive about it. Explain what you'd check and let them say no.
  • Mask secrets in every file you write. Run a regex sweep before declaring a report written.
  • Don't re-run preinstall checks aggressively. If the install is up but unhealthy, the relevant subset of preinstall checks is fine — but full Phase 4 cloud surveys can flood the customer's CloudTrail with surprising activity.
  • One symptom at a time. If multiple pods are unhealthy with different errors, work through them in priority order (backend > workers > optional services), not in parallel — they often have a shared root cause.
  • Stop on cancel. If the user says "stop" or "that's enough", stop and write whatever is in the report so far.
Install via CLI
npx skills add https://github.com/run-llama/helm-charts --skill debug
Repository Details
star Stars 12
call_split Forks 3
navigation Branch main
article Path SKILL.md
More from Creator