name: troubleshoot-zymtrace-profiler description: | Use when the zymtrace profiler agent is misbehaving — pods crash-looping, image pull errors, OOMKilled, CPU profiles working but no GPU profiles, NVML library not found, PC sampling not producing SASS-level data, or license errors on the profiler side. Walks symptom → diagnosis → fix. Focused on the agent itself; for "no data anywhere" use troubleshoot-zymtrace-backend. Trigger phrases: "profiler not working", "profiler pods crashing", "profiler CrashLoopBackOff", "profiler ImagePullBackOff", "agent OOMKilled", "no GPU profiles", "CPU profiles work but GPU doesn't", "NVML library not found", "PC sampling not working", "profiler license rejected", "agent restart cycle", "zymtrace agent unhealthy", "fix the profiler".
Troubleshoot zymtrace Profiler
Helps the user diagnose problems with the profiler agent — the DaemonSet (or Docker / binary) that ships profiles to the backend. Backend / ClickHouse / ingest problems live in troubleshoot-zymtrace-backend. If the customer's symptom is "the UI is empty" without a clearer signal, route there instead — it covers the cross-cutting "no data" diagnostic across both sides.
Greet the user
Open warmly. Diagnostic conversations start frustrated.
👋 Sorry the profiler isn't behaving — let's track it down. Tell me what you're seeing (e.g. "agent pods CrashLoopBackOff", "CPU profiles arrive but no GPU traces", "OOMKilled every 5 min", "license rejected") and I'll walk the diagnosis with you.
Need a human?
- Community Slack: https://join.slack.com/t/zymtrace/shared_invite/zt-3fdidjufl-q~NHxDzQlzal2B9mujfaoQ
- Email: support@zymtrace.com
Once the agent is healthy again, connect the zymtrace MCP to your agent (Claude Code, Codex, or Cursor) — see
configure-zymtrace-mcp— to analyze GPU + CPU flamegraphs in natural language. Docs: https://docs.zymtrace.com/mcp
If the user already named a symptom, skip the prompt and jump to the matching section below.
Pre-flight
Claude runs
helm version --short && kubectl version --client
kubectl cluster-info | head -2
helm list -A | grep -iE 'profiler.*zymtrace|zymtrace.*profiler'
If the profiler release isn't installed at all, that's not a "troubleshoot" — route to install-zymtrace-profiler.
Pre-resolve what you can
Recommend defaults
zymtrace/profiler. Resolve the customer's actual values first. Full policy:shared/conventions.md. Commands below use<NS>/<REL>/<PREFIX>placeholders.
| Variable | Resolve by |
|---|---|
| Profiler release + namespace | helm list -A | grep -iE 'profiler.*zymtrace|zymtrace.*profiler' |
global.namePrefix (drives DaemonSet name) |
helm get values <REL> -n <NS> | awk '/^\s*namePrefix:/ {print $2}' (defaults to zymtrace) |
| Backend release (for cross-cluster context) | helm list -A | grep -iE 'backend.*zymtrace' |
| GPU profiling enabled? | helm get values <REL> -n <NS> | grep -A1 cudaProfiler | grep enabled |
| Current values (so any fix re-applies cleanly) | helm get values <REL> -n <NS> |
Things you may need to ask:
- The exact symptom (CrashLoopBackOff vs OOMKilled vs no-GPU-profiles are different paths).
- Recent changes: helm upgrade, node OS update, NVIDIA driver upgrade.
Symptom router
| Symptom | Section |
|---|---|
| Agent pods CrashLoopBackOff / not Ready | § CrashLoopBackOff |
| Agent pods ImagePullBackOff | § ImagePullBackOff |
| Agent OOMKilled / pod restart cycle | § OOMKilled / restart cycle |
| CPU profiles flow, but no GPU profiles | § No GPU profiles |
| NVML library not found in agent logs | § NVML library not found |
| PC sampling enabled but no SASS-level data | § PC sampling produces nothing |
| License rejected on the profiler side | § License rejected |
| UI is empty entirely (could be agent OR backend) | → troubleshoot-zymtrace-backend |
If the symptom isn't listed, ask the user to describe what they're seeing and pick the closest section.
A one-shot script that gathers the most common signals:
Claude runs
bash ${CLAUDE_PLUGIN_ROOT}/skills/troubleshoot-zymtrace-profiler/scripts/diagnose-agent-not-reporting.sh <NS> <REL>
It dumps DaemonSet readiness, pod status, recent logs, license/connection signals, and a describe for any non-Running pod. Use it as the opening move for most scenarios below.
CrashLoopBackOff
Most agent CrashLoopBackOff failures are one of four causes. Walk them in order.
Claude runs
kubectl describe pod -n <NS> -l app.kubernetes.io/component=profiler | tail -60
kubectl logs -n <NS> -l app.kubernetes.io/component=profiler --tail=100 --previous 2>/dev/null
Cause A — Kernel too old for eBPF
The eBPF unwinder needs a recent enough kernel (~5.4+ is safe; 4.18 is the minimum for some features). On the failed pod's node:
NODE=$(kubectl get pod -n <NS> -l app.kubernetes.io/component=profiler -o jsonpath='{.items[0].spec.nodeName}')
kubectl get node "$NODE" -o jsonpath='{.status.nodeInfo.kernelVersion}'
If the kernel is older than 5.4, recommend a node OS upgrade. There's no in-skill workaround.
Cause B — CAP_SYS_ADMIN stripped
Pod Security Admission, OPA Gatekeeper, Kyverno, or PSP-style admission controllers can strip required capabilities. Look in describe output for:
pod has invalid spec: capabilities not allowed- Admission webhook denials mentioning
SYS_ADMIN seccompdenials
Fix: get the cluster admin to add an exception for the profiler namespace, or run the agent in a less-restricted namespace.
Cause C — /sys/kernel/debug not mounted on the host
Some hardened distros (Talos, Bottlerocket, some GKE Sandbox configurations) don't expose /sys/kernel/debug to privileged pods. Logs will mention failed to open /sys/kernel/debug/tracing or similar.
Fix: either remount debugfs on the host (sudo mount -t debugfs none /sys/kernel/debug — node-side, not in skill), or accept that CPU profiling won't work on those nodes and nodeSelector-exclude them.
Cause D — SELinux / AppArmor denying BPF
RHEL/CentOS with SELinux in enforcing mode can block BPF syscalls. Logs may show Operation not permitted on syscalls.
Fix: ask the cluster admin to add a SELinux policy exception, or relabel the profiler with system_u:object_r:container_runtime_exec_t:s0. This is org-specific — surface and stop.
ImagePullBackOff
Claude runs
kubectl describe pod -n <NS> -l app.kubernetes.io/component=profiler 2>/dev/null \
| grep -A2 -iE 'failed.*pull|imagepullbackoff|errimagepull|unauthorized'
Three usual causes:
| Sign in events | Cause | Fix |
|---|---|---|
unauthorized: authentication required |
Pull secret missing or wrong | kubectl get secret -n <NS> to find pull secret; set global.registry.username + password via --set on helm upgrade --install --reset-then-reuse-values --atomic |
not found / manifest unknown |
Tag doesn't exist | Verify at https://hub.docker.com/u/zystemio or https://github.com/orgs/zystem-io/packages; fix profiler.image.tag or services.common.imageTag |
connection refused / DNS failure / private registry hostname |
Air-gapped, image not mirrored | Mirror ghcr.io/zystem-io/zymtrace-pub-profiler:<VERSION> into the customer's registry; set global.imageRegistry. See install-zymtrace-profiler reference.md § Air-gapped install. |
OOMKilled / restart cycle
Claude runs
kubectl describe pod -n <NS> -l app.kubernetes.io/component=profiler \
| grep -A2 -iE 'oomkilled|exit code: 137|last state'
kubectl top pods -n <NS> -l app.kubernetes.io/component=profiler 2>/dev/null
If Last State: Terminated with Reason: OOMKilled (or exit code 137), the agent's memory limit is too low for the workload it's seeing. Common drivers:
- High GPU kernel-launch rate — every CUDA kernel costs ~25 µs of profiler CPU + heap. >10k kernels/s sustained needs more memory.
- PC sampling enabled on a hot workload — SASS-level data is much larger than aggregated kernel stats.
--samples-per-secondtoo aggressive — default is 19 Hz; bumping to 99 Hz quadruples CPU profile volume.
Fix: bump resource limits in the canonical values file, then re-apply.
profiler:
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
Backup the canonical file first per shared/conventions.md, then:
helm upgrade --install <REL> zymtrace/profiler \
--namespace <NS> -f <values-file> \
--reset-then-reuse-values --atomic --debug
If OOMKills persist after bumping to 2Gi/2 CPU, ask: is PC sampling on? Recommend turning it off in steady state and enabling on-demand for debugging only.
Full sizing reference: install-zymtrace-profiler reference.md § Resource sizing.
No GPU profiles
Agent is running, CPU profiles appear in the UI, but no GPU/CUDA traces. The agent is healthy — the workload isn't loading the CUDA injection library.
Claude runs
POD=$(kubectl get pods -n <NS> -l app.kubernetes.io/component=profiler -o name | head -1)
kubectl logs -n <NS> "$POD" --tail=2000 | grep -E 'Intercepted.*zymtrace.*implant' | head -3
If the search returns lines like:
level=info msg="Intercepted zymtrace implant at /proc/576236/root//var/lib/zymtrace/profiler/libzymtracecudaprofiler.so"
…the agent is intercepting workloads. The issue is downstream (backend ingest / ClickHouse). Hand off to troubleshoot-zymtrace-backend.
If the search is empty, no workload is being GPU-profiled. Walk the workload-side checks:
cudaProfiler.enabled: truein the agent's values?helm get values <REL> -n <NS> | grep -A1 cudaProfiler. If not, enable it andhelm upgrade --install ... --reset-then-reuse-values --atomic.- Library on the host?
kubectl debug node/<gpu-node> -it --image=busybox -- ls /host/var/lib/zymtrace/profiler(needs node-debugger RBAC). Should containlibzymtracecudaprofiler.so. Absent → the agent didn't extract it; restart the DaemonSet pod on that node. - Workload has the env var?
kubectl get pod <gpu-workload> -n <ws-ns> -o jsonpath='{.spec.containers[*].env[?(@.name=="CUDA_INJECTION64_PATH")]}'. Empty → patch the workload's pod spec. - Workload has the volume mount? Same pod, look for
mountPath: /var/lib/zymtrace/profileror/opt/zymtrace/profiler. Missing → patch the workload spec. - Workload started before the agent? Restart the workload after the DaemonSet is Ready on its node.
Workload-side recipes (vLLM, SGLang, Triton, PyTorch / training jobs): install-zymtrace-profiler reference.md § Profiling real workloads.
NVML library not found
In the agent logs:
Failed to load NVML library
Could not find libnvidia-ml.so
The profiler can't find libnvidia-ml.so to collect GPU metrics. (CUDA profiling itself doesn't depend on NVML — kernel-level profiling will still work; only the GPU metrics path is broken.)
Claude runs
POD=$(kubectl get pods -n <NS> -l app.kubernetes.io/component=profiler -o name | head -1)
kubectl exec -n <NS> "$POD" -- find / -name 'libnvidia-ml*' 2>/dev/null | head -5
If the search returns nothing, the host doesn't have the NVIDIA driver mounted into the pod — install / fix the NVIDIA device plugin or GPU operator first.
If the search returns a path, set it explicitly in the values file:
profiler:
args:
- "-collection-agent=..."
- "--enable-gpu-metrics"
- "--nvml-path=/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1" # use the path found above
Drop --nvml-auto-scan once you have the path — it's only useful for first-install discovery. Re-apply via helm upgrade --install ... --reset-then-reuse-values --atomic.
PC sampling produces nothing
The customer enabled PC sampling on a workload but the UI shows no SASS / stall data.
Claude runs
WORKLOAD_NS=<ws-ns>
WORKLOAD_POD=<workload-pod>
kubectl get pod "$WORKLOAD_POD" -n "$WORKLOAD_NS" -o jsonpath='{.spec.containers[*].securityContext.privileged}'
kubectl get pod "$WORKLOAD_POD" -n "$WORKLOAD_NS" -o jsonpath='{range .spec.containers[*].env[?(@.name=="ZYMTRACE_CUDAPROFILER__ENABLE_PC_SAMPLING")]}{.value}{"\n"}{end}'
Two requirements that must both hold:
- The workload container must have
securityContext.privileged: true, or be started undersudoif it's bare-metal. ZYMTRACE_CUDAPROFILER__ENABLE_PC_SAMPLING=truemust be set on the workload (not the agent).
If both look correct and PC sampling still doesn't work, check the NVIDIA driver setting on the node — ask the user to run:
# Node-side, hand to user
cat /proc/driver/nvidia/params | grep RmProfilingAdminOnly
If RmProfilingAdminOnly=1, NVIDIA's admin-only restriction is in effect. Options:
- Run the workload as root inside the privileged container (usually already the case).
- Lift the restriction host-wide via
/etc/modprobe.d/(driver reload required). Full procedure: install-zymtrace-profiler reference.md § PC sampling.
License rejected
Agent logs report:
license invalidlicense expiredforbiddenunauthorized
Claude runs
POD=$(kubectl get pods -n <NS> -l app.kubernetes.io/component=profiler -o name | head -1)
kubectl logs -n <NS> "$POD" --tail=200 | grep -iE 'license|forbidden|unauthorized' | head -10
Three usual causes:
| Symptom | Cause | Fix |
|---|---|---|
license expired |
License is past expiry | Renew via support@zymtrace.com or new GPU trial at https://zymtrace.com/getstarted/; update the backend values file (license lives there, not on the agent) and helm upgrade --install <backend-REL> ... --reset-then-reuse-values --atomic |
forbidden / unauthorized from gateway |
Backend has auth.serviceToken.enabled: true; agent missing --auth-token |
Add profiler.args: - --auth-token=$ZYMTRACE_AUTH_TOKEN to the agent values; backup canonical file first; re-apply |
license invalid immediately |
Wrong license format (truncated JWT, copy-paste corruption) | Re-fetch the license, update backend values, re-apply |
The license is validated at the backend gateway, not on the agent — the agent only sees the result. If license fixes don't take effect, also check the backend pods (troubleshoot-zymtrace-backend § License errors).
Done
Exit when:
- Agent pods are Running on every targeted node (
kubectl get ds -n <NS> <PREFIX>-profilershowsDESIRED == READY). - Recent agent logs (
kubectl logs ... --tail=50) show eitherlicense is valid until ...ORstreaming connection establishedand noconnection refused/forbidden/ OOM-related errors. - If GPU profiling enabled: the
Intercepted zymtrace implantline appears in agent logs after a representative workload starts. - Customer confirms profiles are visible in the UI (or, if backend-side issue is suspected, hand off to
troubleshoot-zymtrace-backend).
Recap which step fixed it so future-them can pattern-match.
Security constraints
- Never disable PC sampling, license validation, or service-token auth as a "fix" — those are symptoms, not the problem. Surface the underlying issue.
- Never modify the canonical values file without taking a backup first per
shared/conventions.md. - Never run
kubectl delete pod/kubectl delete dswithout explicit user confirmation, even if it would "force a restart". - Never
helm upgradewithout--reset-then-reuse-values --atomic. - Never assume node-shell / SSH access. Use
kubectl execinto pods,kubectl debug node/...(with appropriate RBAC), or hand the command to the user. Same rule as the install skills. - Never declare "fixed" without re-running the relevant check in the Done checklist.