nvcf-self-managed-installation - SKILL.md Agent Skill

name: nvcf-self-managed-installation description: >- Install and deploy the nvcf-self-managed-stack helmfile bundle for NVCF self-hosted deployments. Covers clean control-plane installation, teardown, helm values overrides, image pull secrets, and debugging installation failures. Use when deploying, installing, reinstalling, tearing down, or configuring the NVCF self-managed control plane stack, or when the user mentions helmfile, self-managed, self-hosted, control plane installation, or nvcf-self-managed-stack. Do NOT use for local k3d development environments. For local NVCF self-hosted or self-managed cluster setup with k3d, use the local k3d development workflow instead. license: Apache-2.0 compatibility: Requires helmfile >= 1.1.0 < 1.2.0, helm >= 3.12, helm-diff plugin, kubectl matching cluster version author: "nvcf-core-eng nvcf-core-eng@exchange.nvidia.com" version: "1.0.0" tags: [nvcf, self-managed, helmfile, self-hosted, control-plane, installation, deployment, pull-secrets] tools: [Shell, Read, Edit, Grep, Glob] metadata: internal: false author: "nvcf-core-eng nvcf-core-eng@exchange.nvidia.com" version: "1.0" tags: [nvcf, self-managed, helmfile, self-hosted, control-plane, installation, deployment, pull-secrets] languages: [bash, yaml] frameworks: [helmfile, helm, kubectl] domain: cloud-infrastructure

NVCF Self-Managed Stack Operations

Operational guide for the nvcf-self-managed-stack helmfile bundle used to deploy the NVCF control plane.

Instructions

Use this skill for install, upgrade, or teardown work in nvcf-self-managed-stack; validate tooling first, follow the documented helmfile flow, and prefer targeted troubleshooting over ad hoc chart edits.

Prerequisites

Before any operation, ask the user for the path to their extracted nvcf-self-managed-stack directory. Verify it contains the expected structure:

ls <user-provided-path>/helmfile.d/ <user-provided-path>/environments/ <user-provided-path>/secrets/ <user-provided-path>/global.yaml.gotmpl

If the directory does not exist or is missing expected files, guide the user to download the stack package from NGC:

ngc registry resource download-version <org>/nvcf-self-managed-stack:<version>
tar xzf nvcf-self-managed-stack-<version>.tar.gz

All subsequent commands assume you are inside this directory.

Before You Start

Verify tooling and context before any operation:

helmfile --version   # Must be 1.1.x (1.2.0 removed sequential mode)
helm version         # Must be >= 3.12
helm plugin list     # Must include helm-diff >= 3.11
kubectl version      # Client must be within 1 minor version of cluster

All commands run from inside the extracted nvcf-self-managed-stack/ directory:

cd path/to/nvcf-self-managed-stack
ls helmfile.d/ environments/ secrets/ global.yaml.gotmpl

Identify your environment name -- it corresponds to environments/<name>.yaml and secrets/<name>-secrets.yaml.

For a commented example of an EKS environment file, see references/eks-example.yaml.

How Values Flow

Understanding value precedence prevents the most common configuration mistakes.

environments/base.yaml          (defaults)
    -> merged with
environments/<env>.yaml         (your overrides)
    -> consumed by
global.yaml.gotmpl              (Go template, constructs per-chart values)
    -> consumed by
secrets/<env>-secrets.yaml      (sensitive values)
    -> overridden by
release inline values: blocks   (highest precedence)

Critical: global.yaml.gotmpl only passes through specific keys to each chart (image registries, node selectors, storage, replica counts). Setting an arbitrary chart value in the environment file will be silently ignored if global.yaml.gotmpl does not propagate it.

To override arbitrary chart values, use a helmfile release values: block. See Overriding Helm Chart Values.

Clean Installation

1. Create namespaces and image pull secrets (if using a private registry)

for ns in cassandra-system nats-system nvcf api-keys ess sis vault-system cert-manager; do
  kubectl create namespace "$ns" --dry-run=client -o yaml | kubectl apply -f -
done

# Only if pulling from a private registry (e.g., NGC nvcr.io)
export NGC_API_KEY="<your-key>"
for ns in cassandra-system nats-system nvcf api-keys ess sis vault-system cert-manager; do
  kubectl create secret docker-registry nvcr-creds \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password="$NGC_API_KEY" \
    --namespace="$ns" \
    --dry-run=client -o yaml | kubectl apply -f -
done

If using pull secrets, you must also configure each helmfile release to reference them. See Image Pull Secrets.

2. Authenticate helm to your chart registry

Both docker login AND helm registry login are required for NGC. Helmfile uses helm (not docker) to pull OCI charts, so the helm credential store must be authenticated separately.

# NGC -- both commands are required
docker login nvcr.io -u '$oauthtoken' -p "$NGC_API_KEY"
helm registry login nvcr.io -u '$oauthtoken' -p "$NGC_API_KEY"

# ECR
aws ecr get-login-password --region <region> | \
  helm registry login --username AWS --password-stdin <account>.dkr.ecr.<region>.amazonaws.com

Note: helm registry login keeps only the last login per host (multi-org gotcha). The stack and the caches span several NGC orgs that need different keys, but they all live under the one host nvcr.io. A second helm registry login nvcr.io (or docker login) with a different key silently overwrites the first, so pulls for the first org then fail with a misleading 403 Access Denied (not a 401: the auth succeeded, it's the wrong-org key). If you pull charts/images from more than one org, either re-run the login with the correct key immediately before each pull, or pass --username '$oauthtoken' --password <key> per helm pull invocation instead of relying on the stored login. (Pulling everything from a single registry mirror avoids this entirely.)

2b. (Recommended) Pull from a registry the cluster authenticates to automatically

If you mirror the NVCF images into a registry that the cluster can pull from without an explicit image pull secret (for example, a registry where the node or workload identity is granted pull access automatically), point the stack at it and skip the credential steps above.

Point the stack at the mirror in environments/<env>.yaml:

global:
  image:
    registry: <your-registry>            # e.g. registry.example.com
    repository: <mirror-sub-path>
  helm:
    sources:                              # only if you also mirrored the helm charts
      registry: <your-registry>
      repository: <mirror-sub-path>

Skip the credential steps above. When the cluster authenticates to the registry automatically (the node or workload identity has pull access), the pull needs no Kubernetes secret, so you do not need: the per-namespace nvcr-pull-secret from step 1 (still create the namespaces, just not the secret), the global.imagePullSecrets entry, or the docker login/helm registry login nvcr.io from step 2.

One residual credential: nvcf-api still validates the user function image at function create via the account registry credential, so that credential must point at whichever registry holds the user image (set it to your mirror if the image is mirrored there).

3. Set the gateway address in your environment file

Your environment file (environments/<env>.yaml) requires global.domain to be set to the external address of your Envoy Gateway or load balancer. How to obtain it depends on timing:

If Envoy Gateway is already installed (recommended, install it before the NVCF stack):

GATEWAY_ADDRESS=$(kubectl get gateway -n envoy-gateway -o jsonpath='{.items[0].status.addresses[0].value}')
echo "$GATEWAY_ADDRESS"

Set this value in your environment file:

global:
  domain: "<GATEWAY_ADDRESS>"

If Envoy Gateway is not yet installed: Use a placeholder value (e.g., placeholder.local), complete the helmfile sync, then update the domain once the gateway address is available and re-sync. See Example 4 in examples.md for the update flow.

On AWS EKS: The gateway address is typically the DNS name of the AWS Network Load Balancer created by the Envoy Gateway controller (e.g., k8s-envoygateway-xxxxxxxx.us-west-2.elb.amazonaws.com).

4. Preview and deploy

HELMFILE_ENV=<env-name> helmfile template   # Preview rendered manifests
HELMFILE_ENV=<env-name> helmfile sync       # Deploy

Set BOTH Cassandra passwords in the secrets file. The secrets file documents DEFAULT_CASSANDRA_PASSWORD, but the cassandra chart has a second credential, cassandra.serviceRolePassword, that defaults to ch@ng3m3. If you set only DEFAULT_CASSANDRA_PASSWORD, the service role keeps that default and every DB-backed service crash-loops with Bad credentials once it tries to authenticate. Set cassandra.serviceRolePassword to the same value as DEFAULT_CASSANDRA_PASSWORD.

5. Deployment phases

Helmfile deploys in order with dependencies:

Phase	Selector	Services	Wait time
1	`release-group=dependencies`	NATS, OpenBao, Cassandra	5-10 min
2	`release-group=services`	API, SIS, ESS, invocation, grpc-proxy, notary, api-keys, optional LLM gateway/router, optional Vanity Gateway when the stack package includes the addon	5-10 min
3	`release-group=ingress`	Gateway routes	1-2 min
4	`release-group=observability`	Observability stack (if enabled)	1-2 min

Monitor in a separate terminal:

kubectl get pods -A -w

6. Enable Vanity Gateway (optional)

Vanity Gateway is disabled by default and is only needed for customer-facing hostnames or path mappings in front of the standard NVCF API and invocation routes. It is available only in stack packages that include the Vanity Gateway addon. If the extracted stack package does not contain a vanity-gateway release and vanityGateway route values, skip these commands and use a newer stack package that includes them.

If the stack package includes the addon, enable it in environments/<env-name>.yaml:

addons:
  vanityGateway:
    enabled: true
    mappingConfig: {}

The default route host is vanity.<domain> and the backend is vanity-gateway.nvcf:8080. Put deployment-specific host and path mappings under addons.vanityGateway.mappingConfig. If you need custom vanity hosts instead of vanity.<domain>, use the route hostname overrides supported by the stack package and create matching DNS records.

After confirming the package includes the vanity-gateway release, preview and apply the service and route:

HELMFILE_ENV=<env-name> helmfile --selector name=vanity-gateway template
HELMFILE_ENV=<env-name> helmfile --selector name=vanity-gateway sync
HELMFILE_ENV=<env-name> helmfile --selector release-group=ingress sync

Verify the route only when the addon is present and enabled:

kubectl get deploy,svc -n nvcf -l app.kubernetes.io/name=vanity-gateway
kubectl get httproute -A | grep -i vanity
curl -H "Host: vanity.<domain>" "http://<gateway-address>/health"

Do not enable Vanity Gateway for standard API, API Keys, invocation, LLM invocation, or gRPC traffic. Those routes are provided by the base gateway routes.

7. Install compute-plane components from the split stack

The control-plane stack no longer ships nvca-operator. Install compute-plane components from nvcf-compute-plane-stack after control-plane install completes.

Recommended path:

nvcf-cli self-hosted compute-plane register \
  --control-plane-profile deploy/stacks/self-managed/out/control-plane-profile.yaml \
  --cluster-name <cluster-name> \
  --kube-context <compute-kube-context> \
  --region <region> \
  --output deploy/stacks/nvcf-compute-plane/out/<cluster-name>-register-values.yaml

nvcf-cli self-hosted compute-plane install \
  --cluster-name <cluster-name> \
  --kube-context <compute-kube-context> \
  --values deploy/stacks/nvcf-compute-plane/out/<cluster-name>-register-values.yaml

For CLI-driven compute-plane lifecycle details, use the nvcf-self-managed-cli skill.

Clean Teardown

Scope: Only destroy releases managed by this control-plane helmfile stack. The NVCF releases are: nats, openbao-server, cassandra, api-keys, sis, api, invocation-service, grpc-proxy, ess-api, notary-service, optional vanity-gateway, admin-issuer-proxy, and ingress. The control-plane namespaces are: cassandra-system, nats-system, nvcf, api-keys, ess, sis, and vault-system. Do not delete other helm releases or namespaces on the cluster.

Standard teardown

Run from inside the nvcf-self-managed-stack/ directory:

HELMFILE_ENV=<env-name> helmfile destroy

Delete namespaces

for ns in cassandra-system nats-system nvcf api-keys ess sis vault-system cert-manager nvca-system nvcf-backend; do
  kubectl delete namespace "$ns" --ignore-not-found
done

Verify clean

kubectl get ns | grep -E '(cassandra|nats|vault|nvcf|api-keys|ess|sis)'
# Should return empty

Overriding Helm Chart Values

Environment file (limited)

Works only for keys that global.yaml.gotmpl propagates (e.g., cassandra.replicaCount, global.storageClass, global.image).

Release values block (any chart value)

Edit the release in helmfile.d/*.yaml.gotmpl and add a values: block. In the examples below, <private-values> refers to the secrets/ directory at the helmfile stack root.

- name: cassandra
  version: 0.8.0
  condition: cassandra.enabled
  namespace: cassandra-system
  <<: *dependency
  values:
    - ../global.yaml.gotmpl
    - ../<private-values>/{{ requiredEnv "HELMFILE_ENV" }}-secrets.yaml
    - cassandra:
        resources:
          limits:
            cpu: "8"
            memory: 8192Mi
          requests:
            cpu: "2"
            memory: 4096Mi

YAML merge gotcha: When a release uses <<: *dependency or inherit, specifying values: replaces the template's values list. You must re-include global.yaml.gotmpl and the secrets file.

Preview and apply a single release

HELMFILE_ENV=<env> helmfile --selector name=cassandra template  # Preview
HELMFILE_ENV=<env> helmfile --selector name=cassandra sync      # Apply

Image Pull Secrets

There are three distinct credential types. Do not conflate them, and mind the format rule below (two of them fail even with the correct key if the format is wrong):

	Control Plane Pull Secrets	API Bootstrap Registry Creds	Sidecar Image-Pull Secret
Purpose	K8s pulls NVCF service images	nvcf-api validates the user function image at `function create`	nvcf-api pulls the platform sidecars injected into every worker pod
Where	K8s `docker-registry` Secrets + Kyverno ClusterPolicy	`<private-values>/<env>-secrets.yaml` (account-bootstrap)	`NVCF_API_SIDECARS_IMAGE_PULL_SECRET` in `<env>-secrets.yaml` -> OpenBao `services/nvcf-api/kv/sidecars/image-pull-secret` (field `secret`)
Format	standard `.dockerconfigjson` (kubectl builds it for you)	`base64("$oauthtoken:<key>")`	`base64("$oauthtoken:<key>")`

Note: the account-bootstrap cred and the sidecar secret are NOT a dockerconfigjson. The secrets template ships the placeholder REPLACE_WITH_BASE64_DOCKER_CREDENTIAL, which reads as "base64 of a dockerconfigjson." It isn't. Both values are consumed as a raw docker auth string: nvcf-api takes the value verbatim and drops it into the auth field of the dockerconfig it builds. So the value must be base64("$oauthtoken:<key>"), e.g. printf '$oauthtoken:%s' "$KEY" | openssl base64 -A, not base64(<dockerconfigjson>).
Wrong account-bootstrap value: function create returns 400 "must be base64 encoded username:password format".
Wrong sidecar value: every worker's sidecar pull gets a malformed credential and 403s, regardless of which key is inside. The symptom looks like a bad or wrong-org key, but the key is fine and the encoding is the bug. To repair an already-installed stack without a full re-sync:
AUTH=$(printf '$oauthtoken:%s' "$KEY" | openssl base64 -A)
bao kv put services/nvcf-api/kv/sidecars/image-pull-secret secret="$AUTH"
# if NVIDIA Cloud Tasks is enabled, repeat for services/nvct-api/kv/sidecars/image-pull-secret
kubectl rollout restart -n nvcf deploy/nvcf-api   # picks up the new sidecar cred

Note (multi-org): when pulling from NGC (not a mirror), the user function image and the platform sidecars often live in different nvcr.io orgs, each needing a different key, so the account-bootstrap cred (function image) and the sidecar secret (sidecars) are frequently two different keys. Mirroring everything into a single registry the cluster authenticates to automatically collapses this to one credential path.

Configuring with Kyverno (recommended)

Use a Kyverno mutating admission policy to automatically inject imagePullSecrets into all pods in NVCF namespaces. This works uniformly for all charts -- no per-chart configuration or helmfile modifications needed.

# 1. Install Kyverno
helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update
helm install kyverno kyverno/kyverno -n kyverno --create-namespace

# 2. Create pull secret in each namespace
export NGC_API_KEY="<your-key>"
for ns in cassandra-system nats-system nvcf api-keys ess sis vault-system cert-manager; do
  kubectl create secret docker-registry nvcr-pull-secret \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password="$NGC_API_KEY" \
    --namespace="$ns" \
    --dry-run=client -o yaml | kubectl apply -f -
done

# 3. Apply the Kyverno ClusterPolicy
kubectl apply -f kyverno-imagepullsecret-policy.yaml

The policy mutates every pod at admission time, adding imagePullSecrets: [{name: nvcr-pull-secret}]. Verify with:

kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.spec.imagePullSecrets}'
# Should show: [{"name":"nvcr-pull-secret"}]

Not needed if using a CSP built-in credential helper (e.g., ECR with IAM node roles).

For the Kyverno policy YAML and pull secret creation script, see references/pull-secrets.md.

Fake GPU Operator (compute-plane only)

Fake GPU setup and NVCA validation now belong to the split compute-plane stack. Use nvcf-cli self-hosted compute-plane register/install and follow the nvcf-self-managed-cli skill for compute-plane prerequisites, fake GPU workflows, and NVCA troubleshooting.

Debugging

Quick status check

kubectl get pods -A -o wide                        # All pods
helm list -A                                        # All helm releases
kubectl get events -n <ns> --sort-by='.lastTimestamp'  # Recent events

Common failure patterns

Symptom	Cause	Fix
`ImagePullBackOff` + `401 Unauthorized`	Missing or wrong pull secret	Check secret exists, check SA has imagePullSecrets
`Init:0/1` stuck on service pods	Vault-agent waiting for OpenBao	Check OpenBao pods + migration job status
`OOMKilled` on Cassandra	Default resources too small	Override `cassandra.resources` via values block
DB-backed services crash-loop `Bad credentials` connecting to Cassandra	Secrets file set only `DEFAULT_CASSANDRA_PASSWORD`; `cassandra.serviceRolePassword` still defaults to `ch@ng3m3`	Set `cassandra.serviceRolePassword` in the secrets file equal to `DEFAULT_CASSANDRA_PASSWORD`, then re-sync cassandra + the DB-backed services
`Pending` pods	Node selector mismatch or no storage class	`kubectl describe pod`, check labels and storage
Helm release in `failed` state	First install failed partway	`helmfile destroy` the release, then `sync` again
Account bootstrap timeout	Wrong base64 credentials in secrets file	Check `kubectl logs job/nvcf-api-account-bootstrap -n nvcf`
function / account calls return `404 Unknown client_id`	The account-bootstrap post-install hook never ran: the `api` release install failed, or was repaired with live patches instead of a clean re-sync, so the hook (and the account) was skipped	Verify and re-run the hook; see Verify the account bootstrap ran
NVCA agent `CrashLoopBackOff` + "no backend GPUs found"	No GPU operator or fake GPUs on cluster	Install fake-gpu-operator, see Fake GPU Operator
`ImagePullBackOff` in `nvca-system`	Pull secret missing in operator-created namespace	Create secret + update Kyverno policy to include `nvca-system`
Compute-plane install issues (`nvca-operator`, `nvcf-backend`, fake GPU setup)	Split compute-plane stack prerequisites or values are missing	Use `nvcf-self-managed-cli` skill workflows for `compute-plane register/install` diagnostics
Services fail to read vault secrets; `secrets.json` not found	Vault path hardcoded to `/home/app/vault/` in `_helpers.tpl`; the runtime resolves the mounted path relative to the working directory and drops the leading `/`	Override `podAnnotations` and set `JAVA_TOOL_OPTIONS: "-Duser.dir=/"` in release values
NATS connection fails at startup; placement tag mismatch	NATS server tags hardcoded to `dc:ncp`; app derives tag from `AWS_REGION` (e.g., `us-gov-west-1`)	Set `AWS_REGION=ncp` and `NVCF_AWS_REGION=ncp` in env config

Verify the account bootstrap ran

The default account is created by a post-install hook on the api release (job/nvcf-api-account-bootstrap in nvcf). Because it is a post-install hook, it only fires on a successful api install. If that install failed and you repaired it with live patches (rather than a clean re-sync), the hook never runs, no account is created, and every subsequent function/account call returns a cryptic 404 Unknown client_id with nothing obviously wrong in the running pods.

Verify it ran after the first sync:

kubectl get job -n nvcf nvcf-api-account-bootstrap \
  -o jsonpath='{.status.succeeded}'   # expect 1

If the job is absent or 0, re-run it by re-syncing the api release so the hook fires again (delete the prior job first if it lingers in a completed/failed state and blocks the hook):

kubectl delete job -n nvcf nvcf-api-account-bootstrap --ignore-not-found
HELMFILE_ENV=<env> helmfile --selector name=api sync
kubectl logs -n nvcf job/nvcf-api-account-bootstrap   # confirm the account was created

For expanded debugging recipes, see references/debugging.md.

Additional Resources

For worked examples, see examples.md
For a commented EKS environment file, see references/eks-example.yaml
For helmfile structure details, see references/helmfile-structure.md
For per-chart imagePullSecrets keys, see references/pull-secrets.md
For debugging recipes, see references/debugging.md

After deployment, use the nvcf-self-managed-cli skill to create functions, create tasks, manage API keys, and invoke endpoints.