nrp-k8s-batch

name: nrp-k8s-batch description: > Run batch processing jobs on the NRP (National Research Platform) Nautilus Kubernetes cluster. Covers the mandatory requirements for CPU jobs: opportunistic priority class, resource requests/limits, and GPU node avoidance. Use when creating or managing Kubernetes jobs on the NRP Nautilus cluster, or when the user mentions NRP, Nautilus, or needs to run batch workloads on a shared academic cluster. license: Apache-2.0 compatibility: > Requires kubectl configured for the NRP Nautilus cluster (namespace: biodiversity). Works with any agent that can run shell commands. metadata: author: boettiger-lab version: "1.0"

NRP Kubernetes Batch Jobs

The NRP (National Research Platform) Nautilus cluster is a shared academic Kubernetes cluster primarily designed for GPU workloads. CPU-only batch jobs require specific configuration to coexist properly.

Namespace

All our jobs run in the biodiversity namespace:

kubectl -n biodiversity get jobs

Mandatory Requirements for CPU Jobs

1. Priority class (REQUIRED)

All CPU jobs must use the opportunistic priority class. This makes pods preemptible so they don't block GPU users. Without this, your job may be rejected or cause problems for other users.

spec:
  template:
    spec:
      priorityClassName: opportunistic

2. Resource requests and limits (REQUIRED)

The NRP cluster requires both requests and limits on every container. Jobs without resource specifications will not be scheduled.

Set requests equal to limits for guaranteed QoS (Quality of Service). Be respectful — request only what you need:

resources:
  requests:
    cpu: "4"
    memory: "8Gi"
  limits:
    cpu: "4"
    memory: "8Gi"

If you need ephemeral scratch disk (e.g., for large temporary files), request it explicitly:

resources:
  requests:
    cpu: "4"
    memory: "32Gi"
    ephemeral-storage: "250Gi"
  limits:
    cpu: "4"
    memory: "32Gi"

3. GPU node avoidance (recommended)

To avoid wasting GPU node capacity on CPU-only work, add a node anti-affinity. This is strongly recommended but not strictly enforced:

spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: feature.node.kubernetes.io/pci-10de.present
                    operator: NotIn
                    values: ["true"]

Minimal Job Example

apiVersion: batch/v1
kind: Job
metadata:
  name: my-job
spec:
  backoffLimit: 2
  ttlSecondsAfterFinished: 10800
  template:
    spec:
      priorityClassName: opportunistic
      restartPolicy: Never
      containers:
        - name: worker
          image: ghcr.io/boettiger-lab/datasets:latest
          command: ["bash", "-c", "echo hello"]
          resources:
            requests:
              cpu: "4"
              memory: "8Gi"
            limits:
              cpu: "4"
              memory: "8Gi"

Secrets

Two secrets are available in the biodiversity namespace. See the nrp-s3 skill for full details on S3 environment variables.

`aws` — S3 credentials (environment variables)

env:
  - name: AWS_ACCESS_KEY_ID
    valueFrom:
      secretKeyRef:
        name: aws
        key: AWS_ACCESS_KEY_ID
  - name: AWS_SECRET_ACCESS_KEY
    valueFrom:
      secretKeyRef:
        name: aws
        key: AWS_SECRET_ACCESS_KEY

`rclone-config` — Rclone configuration (volume mount)

volumeMounts:
  - name: rclone-config
    mountPath: /root/.config/rclone
    readOnly: true
volumes:
  - name: rclone-config
    secret:
      secretName: rclone-config

Useful Fields

Field	Purpose
`ttlSecondsAfterFinished: 10800`	Auto-deletes completed jobs after 3 hours to avoid resource leaks
`completionMode: Indexed`	For parallel workloads — each pod gets a unique `JOB_COMPLETION_INDEX`
`backoffLimitPerIndex`	Retries per index (instead of global `backoffLimit`) — useful for indexed jobs
`podFailurePolicy` with `DisruptionTarget: Ignore`	Don't count preemptions as failures (important with `opportunistic` priority)

Common Pitfalls

Missing resource requests/limits — Jobs will not schedule without them. Always specify both, and keep requests = limits.
Forgetting priorityClassName: opportunistic — Required for all CPU jobs on this cluster.
Requesting too many resources — Be respectful. Don't request 64 CPUs if you only need 4.
Max 200 completions per indexed job — Hard limit to avoid overwhelming the cluster's etcd.