exec-slurm-compile - SKILL.md Agent Skill

name: exec-slurm-compile description: Compile TensorRT-LLM on a SLURM cluster. Covers submitting a batch job with a container image, monitoring the job, and verifying the build. Use when the user wants to compile TRT-LLM remotely via SLURM rather than on a local compute node. license: Apache-2.0 metadata: author: NVIDIA Corporation

Compile TensorRT-LLM on SLURM Cluster

Submit, monitor, and verify a TensorRT-LLM compilation job on a SLURM cluster using enroot containers.

When to Use

Scenario	Use This Skill?
User wants to compile TRT-LLM on a SLURM cluster	Yes
User is already on a compute node and wants to compile	No — use `exec-local-compile` skill instead

Finding the Docker Image

The official Docker image tag for a given TensorRT-LLM version is recorded in the repo itself:

<repo_dir>/jenkins/current_image_tags.properties

Read this file to find the current image URL. The format is <registry-host>/<namespace>/tensorrt-llm:<tag>, for example <your-registry>/<your-namespace>/tensorrt-llm:pytorch-25.12-py3-aarch64-ubuntu24.04-trt10.14.1.48-skip-tritondevel-202602011118-10901. Substitute <your-registry> and <your-namespace> with your own container registry coordinates.

Pre-dumping the Container Image (enroot import)

SLURM clusters using enroot/pyxis require a .sqsh container image. To avoid download overhead at compile time, pre-dump the image in advance using the enroot-import companion script:

# Basic usage — submits a SLURM job on a CPU partition to import the image
enroot-import --partition cpu_datamover --debug <docker_image_url>

The script submits an sbatch job that runs enroot import docker://<image_url> and produces a .sqsh file in the current directory. The output on stdout is the SLURM job ID.

enroot-import flags

Flag	Description
`-p, --partition`	SLURM partition for the import job (use a CPU partition like `cpu_datamover`)
`-d, --debug`	Enable debug output and preserve the SLURM log (recommended)
`-o, --output`	Custom output path for the `.sqsh` file
`-A, --account`	SLURM account (defaults to user's first account)
`-t, --time`	Time limit for the import job (default: 1 hour)
`-n, --just-print`	Print the sbatch command without executing
`-J, --job-name`	Custom job name

enroot-import workflow

Read the image tag from jenkins/current_image_tags.properties in the TRT-LLM repo.
Run enroot-import to submit the import job:
```
cd <directory_where_sqsh_should_be_stored>
<path_to>/enroot-import --partition cpu_datamover --debug <image_url>
```
IMPORTANT: Replace the first / after the registry host with # to avoid credential issues. For example, <your-registry>/<your-namespace>/tensorrt-llm:xxx becomes <your-registry>#<your-namespace>/tensorrt-llm:xxx.
Wait for the import job to complete (squeue -j <job_id>).
The resulting .sqsh file is the container_image used in the compile step.

Prerequisites

The user must provide (or you must ask for) these values:

Parameter	Description	Example
`container_image`	Path to `.sqsh` container image (see enroot import above)	`/path/to/pytorch.sqsh`
`repo_dir`	Path to the TensorRT-LLM repository	`/path/to/TensorRT-LLM`
`mount_dir`	Top-level directory to bind-mount into the container	`/shared/users`
`partition`	SLURM partition	`batch`
`account`	SLURM account	`my_account`

Optional parameters:

Parameter	Description	Default
`jobname`	SLURM job name	`trtllm-compile.<username>`
`gpu_count`	Number of GPUs to request	`4`
`time_limit`	Job time limit	`02:00:00`
`arch`	GPU architecture(s) for `-a` flag	`100-real`
`extra_build_args`	Extra flags for `build_wheel.py`	(none)

Companion Scripts

This skill includes three companion scripts in scripts/:

Script	Purpose
`enroot-import`	Pre-dump a Docker image to `.sqsh` via a SLURM batch job
`submit_compile.sh`	Template for submitting the SLURM job — copy and customize
`compile.slurm`	SLURM batch script — launches the container and calls `compile.sh`
`compile.sh`	Runs inside the container — executes `build_wheel.py`

Scripts directory: skills/exec-slurm-compile/scripts/

Instructions

Follow these steps in order:

Step 0: Resolve the Container Image (if needed)

If the user does not already have a .sqsh container image:

Read the Docker image tag from <repo_dir>/jenkins/current_image_tags.properties.

Use enroot-import to pre-dump it:

cd <directory_for_sqsh_files>
<scripts_dir>/enroot-import --partition cpu_datamover --debug <image_url>

Monitor the import job with squeue -j <job_id>.
Once complete, the .sqsh file path becomes the container_image parameter.

If the user already has a .sqsh file, skip this step.

Step 1: Gather Information

Ask the user for any missing prerequisite values listed above. At minimum you need:

container_image (or the Docker image URL — then run Step 0 first)
repo_dir
mount_dir
partition and account

If the user has used this workflow before, check if previous values are stored in memory files.

Step 2: Prepare the Scripts Directory

The compile scripts must be accessible from inside the container (i.e., under mount_dir). Either:

Option A — Copy companion scripts to a location under mount_dir:

scripts_dir=<mount_dir>/<username>/workspace/tensorrt_llm_scripts
mkdir -p ${scripts_dir}/log
cp skills/exec-slurm-compile/scripts/compile.sh ${scripts_dir}/
cp skills/exec-slurm-compile/scripts/compile.slurm ${scripts_dir}/
chmod +x ${scripts_dir}/compile.sh ${scripts_dir}/compile.slurm

Option B — If the user already has scripts at a known location, use those directly.

Step 3: Submit the Job

Run sbatch from the login node (or a node with SLURM client access):

sbatch \
    --nodes=1 --ntasks=1 --ntasks-per-node=1 \
    --gres=gpu:<gpu_count> \
    --partition=<partition> \
    --account=<account> \
    --job-name=<jobname> \
    --time=<time_limit> \
    <scripts_dir>/compile.slurm \
    <container_image> <mount_dir> <scripts_dir> <repo_dir>

Capture and report the job ID from the sbatch output.

Step 4: Monitor the Job (Proactive — Do NOT Wait for User)

You MUST actively poll the job until it completes. Do not submit and walk away.

# Check job status (repeat every 30-60 seconds)
squeue -j <job_id> -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"

# Once running, periodically tail the log (do NOT use tail -f, use tail -30 instead)
tail -30 <scripts_dir>/log/compile_<job_id>.srun.log

Monitoring loop:

Poll squeue -j <job_id> to check state
If PD (pending) — report the reason, keep polling every 30-60s
If R (running) — tail the build log every 30-60s; look for [XX%] Building, errors, or completion
If the job disappears from squeue, it has finished — proceed to Step 5
If F (failed) — immediately read the full log and report the error

Progress indicators to look for in the log:

[XX%] Building CXX object... — compilation progress
Linking CXX... — link phase
FAILED:, error:, fatal error: — build failure
Successfully built — success

Step 5: Verify the Build

Once the job completes, check for success:

# Check SLURM exit code
sacct -j <job_id> --format=JobID,State,ExitCode,Elapsed

# Check the build log for errors
tail -50 <scripts_dir>/log/compile_<job_id>.srun.log

A successful build ends with a message like Successfully built tensorrt_llm or completes without error.

Common Build Flags Reference

Flag	Description
`--trt_root /usr/local/tensorrt`	TensorRT installation path (standard in NVIDIA containers)
`--benchmarks`	Build the C++ benchmarks
`-a "100-real"`	Target architecture — `100` for Blackwell, `90` for Hopper, etc.
`--nvtx`	Enable NVTX markers for profiling
`--no-venv`	Skip virtual environment creation
`--use_ccache`	Use ccache to speed up recompilation
`--skip_building_wheel`	Build in-place without creating a wheel file
`-f`	Fast build — skip some kernels for faster dev compilation
`-c`	Clean build — wipe build directory before building

Common architecture values:

"100-real" — Blackwell (B200, GB200)
"90-real" — Hopper (H100, H200)
"89-real" — Ada Lovelace (L40S)
"80-real" — Ampere (A100)
"90;100-real" — Multiple architectures

Troubleshooting

Issue	Solution
`sbatch: error: invalid partition`	Verify partition name with `sinfo -s`
`sbatch: error: invalid account`	Check available accounts with `sacctmgr show assoc user=$USER`
Container image not found	Verify the `.sqsh` path exists and is readable
Build fails with missing TensorRT	Ensure `--trt_root` points to the correct path inside the container
Build OOM (out of memory)	Reduce parallelism with `-j <N>` flag to `build_wheel.py`
`srun: error: Unable to create step`	The node may lack enroot/pyxis — check with cluster admin
Job stuck in `PD` state	Check `squeue -j <id> -o %R` for the reason (e.g., resource limits, priority)
`enroot import` fails with auth error	Check `~/.config/enroot/.credentials` has the correct registry credentials
`enroot import` produces empty/corrupt `.sqsh`	Re-run with `--debug` and check the SLURM log; verify the image URL has no `https://` prefix
Weird compile issues	Retry with a clean build (`-c` flag)
`QOSGrpNodeLimit` shown in `NODELIST(REASON)`	Not a blocker, just wait for the job to get scheduled

Example Interaction

User: "Compile TRT-LLM on the OCI cluster"

Agent actions:

Ask for container image path, repo path, mount dir (if not known)
Confirm partition/account for OCI cluster
Copy scripts to accessible location under mount_dir
Submit with sbatch
Report job ID
Monitor with squeue until complete
Check logs and report success/failure