name: install-zymtrace-backend description: | Use when installing the zymtrace backend (the AI optimization platform that ingests CPU/GPU profiling data). Covers Kubernetes (Helm) and single-node Docker Compose. Handles license setup, choosing in-cluster vs external ClickHouse/Postgres/object storage, ingress with gRPC and TLS, and air-gapped installs via a custom image registry. Trigger phrases: "install zymtrace", "install zymtrace backend", "deploy zymtrace", "set up zymtrace on kubernetes", "set up zymtrace on EKS / GKE / AKS / on-prem", "helm install zymtrace", "docker compose zymtrace", "stand up the zymtrace platform", "first zymtrace install", "deploy the backend services".
Install zymtrace Backend
Helps the user install the zymtrace backend — gateway, ingest, web, symdb, identity, UI, migrate, plus data stores (ClickHouse, Postgres, S3-compatible object storage).
The profiler agent is a separate install (
install-zymtrace-profilerskill). This skill is only the backend that receives data.
Deep details, secret-creation commands, Docker Compose, and air-gapped live in ${CLAUDE_PLUGIN_ROOT}/skills/install-zymtrace-backend/reference.md — read it when the decision tree points you there.
Greet the user (start here)
Before any commands or questions, open with a warm welcome. Adapt to context, but always cover: a thank-you, the support channels, and a quick map of what's coming. Sample:
👋 Thanks for choosing zymtrace! I'll walk you through installing the backend — the platform that ingests your CPU/GPU profiling data, surfaces optimization insights, and serves the UI.
If you get stuck at any point, reach out:
- Community Slack: https://join.slack.com/t/zymtrace/shared_invite/zt-3fdidjufl-q~NHxDzQlzal2B9mujfaoQ
- Email: support@zymtrace.com
- Sign up / GPU trial license: https://zymtrace.com/getstarted/
Tip — analyze GPU and CPU flamegraphs via MCP: once zymtrace is running, connect the zymtrace MCP to your agent (Claude Code, Codex, or Cursor) — see
configure-zymtrace-mcp— and analyze GPU + CPU flamegraphs in natural language. Docs: https://docs.zymtrace.com/mcpHere's the plan:
- Verify your tools (
helm,kubectl) and resolve cluster / namespace.- Check whether you already have a values file — if not, we'll build one together.
- Install, verify, then start a quick port-forward so you can see the UI right away.
- Decide on long-term exposure (NodePort / ALB / NGINX / Cloud LB).
- Hand off to the profiler install so you have data to look at.
Ready when you are — let's start.
Trim the greeting if the user has already given you specifics (cluster, values file, target version). Always include the support links once per session.
Sources of truth (never invent keys)
- Live chart
values.yaml(every key is documented inline): https://raw.githubusercontent.com/zystem-io/zymtrace-charts/main/charts/backend/values.yaml - Docs: https://docs.zymtrace.com/install/backend/helm-docker (no public source repo — fetch URLs)
- Full URL map:
shared/references.md
Pre-flight: verify the tools
Do not run any install command until both binaries are confirmed installed.
Claude runs
helm version --short && kubectl version --client
kubectl cluster-info | head -2
If helm is missing → point user to https://helm.sh/docs/intro/install/. Do not offer to install it. Same rule for kubectl → https://kubernetes.io/docs/tasks/tools/. If kubectl cluster-info fails, ask the user to set their kubeconfig.
Check for a customer-provided values file
Before walking the decision tree, ask:
Did Zymtrace send you a values file (often named
custom-values.yaml,backend-values.yaml, or<company>-values.yaml)? It usually contains your license, DB modes, and other pre-agreed settings.
If yes → read it, skip any decision-tree question whose answer is already set in the file, and go directly to Step 4 with -f <their-file>. Full policy: shared/conventions.md § Customer-provided values file.
If no → walk the decision tree below; at the end, write the result to ./custom-values.yaml and tell them to commit it to source control.
Pre-resolve what you can
Before asking the user any question, resolve from the environment. Only ask when a check fails or returns ambiguous output.
Namespace + release name. Recommend
zymtrace/backend— these are the defaults in every doc and example, keep them unless the user has a specific reason to deviate. If a release already exists on the cluster, use its namespace and name. Full policy:shared/conventions.md. Commands below use<NS>and<REL>as placeholders; resolve before running.
| Variable | Resolve by |
|---|---|
| Current cluster context | kubectl config current-context |
| Existing zymtrace release? | helm list -A | grep -i zymtrace |
| Latest chart version | helm search repo zymtrace/backend --versions | head -3 (after helm repo add) |
| Ingress controllers present | kubectl get ingressclass |
| Metrics-server (HPA prereq) | kubectl top nodes — returns metrics if installed, errors Metrics API not available if not. On EKS, do not confuse with v1.metrics.eks.amazonaws.com (EKS extension API, doesn't satisfy HPA). |
| Default storage class | kubectl get sc |
Things you must ask:
- Tier: free CPU-only / GPU trial / paid (licensing).
- DB modes for each of ClickHouse / Postgres / object storage.
- Domain + TLS arrangement (if production).
- Auth type and IdP details if OIDC.
- Scale tier (agent count + retention).
- Private/air-gapped registry?
Blockers vs recommendations (don't conflate)
Tone matters. Frame findings precisely so the customer doesn't think they have an outage when they don't.
Blockers (stop the install, surface, ask the user to resolve first):
helmorkubectlnot installed.kubectl cluster-infofails.- Referenced Kubernetes secret missing (chart will fail).
- Values file fails
helm template(schema error / invalid YAML). - Previous release is stuck in an in-progress state.
Recommendations (note in one short line, proceed):
- Metrics-server not installed with
hpa.enabled: true→ install succeeds, HPAs just won't scale. Customer can install metrics-server later. - License key inline in values file → fine for dev/PoC. Recommend
licenseKeySecretNamefor prod, don't block. auth.type: none→ trusted-network installs run this way on purpose.- No ingress configured → ClusterIP + port-forward is a valid temporary state; the expose skill adds ingress whenever the customer is ready.
Rule of thumb: if the operation will succeed but something is suboptimal, that's a recommendation, not a blocker. Phrase it as one short line, not a multi-bullet alarm.
Decision tree
Walk these in order. Don't guess defaults — fetch the chart values.yaml when uncertain.
1. Platform
| Answer | Path |
|---|---|
| Single node / eval / laptop | → see reference.md § Docker Compose install. Stop here, don't continue this tree. |
| Kubernetes (prod, staging, on-prem, GKE/EKS/AKS) | → continue. |
If unsure, default to Kubernetes — Docker Compose has no HA/ingress story.
2. License
- Free CPU-only tier needs no key.
- GPU trial / Paid: ask where to put it (inline vs Kubernetes secret). Recommend secret for prod.
- Where to get a trial key, tier table, and placement details: reference.md § License placement.
3. Databases
zymtrace needs ClickHouse, Postgres, and S3-compatible object storage. Each has a mode:
| Mode | When to use |
|---|---|
create (default) |
In-cluster, chart-managed. Fastest path. Fine for dev/PoC, viable for small prod. |
use_existing |
External (ClickHouse Cloud, on-prem CH, RDS, AWS S3, GCS, MinIO). Pick this if the org already runs managed DBs. |
aws_aurora / gcp_cloudsql |
Postgres only. IAM-authenticated. Requires IRSA (Aurora) or Workload Identity (CloudSQL). |
Ask which mode per service. Assemble from values/k8s-external-dbs.yaml.
⚠️ Postgres
secure: trueneeded for TLS-enforced managed Postgres. ClickHouseuse_existing.hostMUST be the scheme + HTTP port (https://host:8443) — not native 9000. Only the HTTP interface is supported.
4. Network exposure
Two paths:
- Install without exposure → leave
ingress.enabled: false(default). Gateway becomesClusterIP-only; the user can reach it viakubectl port-forwardfor verify, then add exposure later. - Install with exposure already configured → hand off to the
expose-zymtrace-backendskill for the exposure decision (NodePort / LoadBalancer / NGINX Ingress / ALB Ingress). That skill will edit the same canonical values file in place before this install proceeds.
Either way, exposure can be added or changed at any time via the expose skill — it doesn't have to be locked in at install.
5. Auth
none— open access; trusted networks only.local— built-in user/password + admin user; requires Ed25519 signing keys.oidc— Google / Okta / Auth0 / Azure AD. NeedclientId,clientSecret,issuerUri, registered redirect URI.
Don't propose
basic— deprecated and being removed.
Prod default: oidc if they have an IdP, else local with admin password from a secret.
6. Scale
How many agents will report, and what retention? Drives ClickHouse sizing more than anything else.
| Scale | Template |
|---|---|
| < 20 agents, < 14d | k8s-minimal.yaml. Defaults fine. |
| 20–100 agents, mixed CPU/GPU, 30d | values/k8s-large-scale.yaml. Bumped CH (500Gi, 2–6 CPU, up to 16Gi mem), tuned probes, HPA scale-down. |
| 100+ agents, multi-region, long retention | Combine k8s-large-scale.yaml + k8s-external-dbs.yaml. Externalize ClickHouse. |
Storage rule of thumb: ~5Gi per agent per 30 days for mixed CPU+GPU. GPU-heavy agents produce ~5–8× the events of CPU-only.
7. Air-gapped / private registry
If mentioned: see reference.md § Air-gapped install. Mirror zymtrace-pub-backend, zymtrace-pub-ui (plus DB images if mode: create), then set global.imageRegistry + global.appImageRegistry.
Kubernetes install (Helm)
Prerequisites
- Kubernetes 1.20+, Helm 3.x.
- Metrics Server (only if HPA enabled — verify with
kubectl top nodes, NOT just by listing APIServices on EKS). - CNI with NetworkPolicy enforcement (Calico/Cilium/Weave). Plain Flannel → set
services.activateNetworkPolicies: false.
Step 1: Add the Helm repo
Claude runs
helm repo add zymtrace https://helm.zystem.io
helm repo update
helm search repo zymtrace/backend --versions | head -5
ERROR: not a valid chart repository → network/proxy issue, or air-gapped. Switch to the air-gapped path.
Step 2: Pick a values template and customize
Pick one of these install bundles as the starting point for the customer's canonical values file:
values/k8s-minimal.yaml— chart-managed DBs, NodePort, no auth. Fastest path for dev.values/k8s-external-dbs.yaml— external ClickHouse + Postgres + S3 (incl. Aurora / CloudSQL).values/k8s-large-scale.yaml— 100+ agents, 30d retention, bumped CH, tuned probes.
Copy to the canonical filename (e.g. zymtrace-custom-values.yaml if the customer doesn't have one) and edit placeholders. Need ingress now? Run the expose-zymtrace-backend skill — it'll edit the same canonical file in place — then come back to Step 3 here.
Step 3: Create secrets
What you need to do in a terminal
Secrets must be created by the user — never write license keys, OIDC client secrets, admin passwords, or DB passwords into the conversation or values files.
Create only the secrets your values file references. Command reference: reference.md § Creating secrets.
Typical minimum for prod: zymtrace-license, plus one of oidc-creds / zymtrace-admin, plus zymtrace-signing-keys (auth=local/oidc).
The chart does not pre-check that referenced secrets exist — missing secrets surface as CrashLoopBackOff in Step 5.
Step 4: Install
<REL>and<NS>are placeholders for the resolved release name + namespace. Recommended defaults:backend/zymtrace— use them unless the user has a specific reason to deviate. Substitute before running.
Claude runs
helm upgrade --install <REL> zymtrace/backend \
--namespace <NS> --create-namespace \
-f <values-file>.yaml \
--reset-then-reuse-values \
--atomic --debug
--reset-then-reuse-values is mandatory on every run for this chart, including first install. Why: reference.md § Why always reset-then-reuse-values.
--atomic rolls back on failure (avoids half-deployed state). --debug prints rendered manifests.
ERROR: failed pre-install: timed out → usually a missing secret. Run kubectl get events -n <NS> --sort-by=.lastTimestamp | tail -20, fix the secret per Step 3, re-run.
ERROR: another operation … in progress → a previous helm op is stuck. helm history <REL> -n <NS>; resolve with helm rollback or helm uninstall (destructive — confirm with the user first).
Step 5: Verify
Claude runs
bash ${CLAUDE_PLUGIN_ROOT}/skills/install-zymtrace-backend/scripts/verify-backend.sh <NS> <REL>
If the user has overridden global.namePrefix, also pass it: PREFIX=<value> bash ${CLAUDE_PLUGIN_ROOT}/skills/install-zymtrace-backend/scripts/verify-backend.sh <NS> <REL>.
Runs helm status, dumps kubectl get for pods/jobs/svc/ingress/hpa, dumps logs for each backend service, describes any non-Running pod. Use the Done checklist below as exit criteria — do not declare success until every box checks.
Step 6: Persist the canonical values file
Claude runs
# Back up first if the file already exists (won't on a true greenfield install)
[ -f <values-file> ] && cp <values-file> <values-file>.bak.$(date +%Y%m%d-%H%M%S)
helm get values <REL> -n <NS> > <values-file>
<values-file> is whatever name the customer gave you in §0 / Pre-resolve (e.g. acme-zymtrace.yaml). If they didn't provide one, default to zymtrace-custom-values.yaml — that's now their canonical file. See shared/conventions.md for the rules on respecting customer filenames and backing up before writing.
If a backup was created, tell the user the path. Recommend they commit the new canonical file: git add <values-file> && git commit -m "zymtrace: capture <NS>/<REL> values after install".
Step 7: Quick-win port-forward (let them see the UI now)
Once Done is satisfied, offer to port-forward immediately — this is the fastest way to give the customer a visible result before they commit to a long-term exposure decision.
Want me to start a port-forward right now so you can open the UI? It runs in the background; you can switch to a permanent exposure (NodePort / LoadBalancer / Ingress) afterwards.
If yes:
Claude runs
kubectl port-forward -n <NS> svc/<PREFIX>-gateway 8080:80 > /tmp/zymtrace-pf.log 2>&1 &
echo "port-forward PID: $!"
sleep 2
curl -fsI http://localhost:8080 | head -1 # sanity check
Then tell them:
The UI is at http://localhost:8080. To stop the port-forward later:
pkill -f 'kubectl port-forward.*<PREFIX>-gateway'. If port 8080 is taken, change it to8081:80(or any free local port).
If no, skip to Step 8.
ERROR: Unable to listen on port 8080: bind: address already in use → pick another local port. Re-run with 8081:80 etc.
Step 8: Ask how they want to expose the backend long-term
Always ask — don't assume port-forward is good enough.
| Option | When it fits |
|---|---|
| 1. NodePort | PoC, on-prem, no cloud LB available |
| 2. AWS ALB Ingress (HTTPS via ACM) | EKS — internal or internet-facing |
| 3. NGINX Ingress (TLS via cert-manager) | On-prem / non-AWS, or already running NGINX |
| 4. Cloud LoadBalancer (direct NLB/CLB) | Rare; ask what they want before suggesting |
If they prefer to stick with port-forward for now → skip to Step 9.
For any of 1–4 → hand off to the expose-zymtrace-backend skill, which edits the canonical values file in place and applies via helm upgrade --install. Ask them to provide the hostname themselves — don't suggest a specific pattern. Once that completes, return here for Step 9.
Step 9: Hand off to profiler install
The backend has no data until a profiler agent reports to it. Suggest the install-zymtrace-profiler skill next.
Done
Exit when ALL of the following are true (substitute <NS> / <REL> / <PREFIX>):
-
helm status <REL> -n <NS>reportsSTATUS: deployed. - All pods in the
<NS>namespace areRunning. - Migration succeeded — either
<PREFIX>-migrateJob is at1/1succeeded, or the Job is absent (Helmpre-installhook auto-deletes on success). Withhelm status STATUS: deployed, absence is the expected state, not a failure. -
<PREFIX>-gatewayservice exists; ifingress.enabled=true, an Ingress object also exists. - No license / auth / forbidden errors in
kubectl logs deployment/<PREFIX>-ingest -n <NS> --tail=50. - Gateway responds:
curl -fsI http://<host>returns anything except connection-refused / timeout / 5xx.
If any box fails, hand off to the troubleshoot-zymtrace-backend skill, or use the verify-backend.sh output.
Common pitfalls
- NGINX/Traefik missing
backend-protocol: "GRPC"→ agents can't push profiles; UI may still work. - ALB set to GRPC backend → HTTP/1.1 clients get HTTP 464. Use HTTP + HTTP2. See reference.md.
proxy-body-sizeunset / too small → symbol uploads fail silently.- NodePort + OIDC →
redirectUrimust be set explicitly (chart can't auto-derive) and registered with the IdP. - ClickHouse
use_existing.hoston native port9000→ ingest crashes. Use HTTP8123/8443. auth.admin.password=adminin prod → usepasswordSecretName.- HPA on without metrics-server (recommendation, not a blocker) → install succeeds; HPAs sit at
<unknown>/80%and stay atminReplicas. Truth check:kubectl top nodes. To enable scaling later:kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml. On EKS,v1.metrics.eks.amazonaws.comis not the metrics-server HPA needs — don't be fooled by it.
Security constraints
- Never write a raw license key, OIDC client secret, admin password, or DB password into any values file, session message, or commit. Use
*SecretName/*SecretKeyand create the secret imperatively withkubectl create secret. - Never generate a Kubernetes
SecretYAML manifest from this skill — alwayskubectl create secret generic …so values never land on disk. - Never use namespace
defaultfor zymtrace resources. Always pass--namespace <NS>with an explicit (non-default) namespace. - Never run
helm uninstall,kubectl delete namespace,kubectl delete pvc, or any operation that drops persistent data without explicit user confirmation. - Never propose
auth.type: basic— deprecated. - Never issue
helm upgradeorhelm upgrade --installfor this chart without--reset-then-reuse-values— even on first install, even with-f. The flag is harmless on a fresh install and is the only guard against silent data loss on subsequent partial--setupgrades. - Never disable TLS in production (
--disable-tlsis fine for dev/NodePort only). - Never skip Step 5 verification — pods can be
Runningwhile ingest is silently rejecting profiles.