name: jarvislabs description: Agent guide for running and monitoring GPU experiments with the jl CLI on JarvisLabs.ai.
JarvisLabs GPU CLI (jl) — Agent Guide
Getting Started
Verify auth with jl status --json before doing anything. If not logged in, use jl setup --token <token> --yes. You can also authenticate via export JL_API_KEY="...".
Use --help on any command to discover flags (e.g., jl run --help, jl create --help). If something goes wrong, use jl run logs, jl run status, and jl exec to diagnose — don't guess.
Mental Model
- Machine commands (
jl create/list/get/pause/resume/destroy/rename/ssh/exec/upload/download) = GPU instance lifecycle and access. jl run= managed job on an instance. Uploads code, sets up a Python environment, runs your script in the background with log tracking.jl exec= run any command on an instance. Use for system checks (nvidia-smi, ps, df), debugging failed runs, inspecting files, or any raw shell access. No environment setup, no tracking. This is your escape hatch whenjl rundoesn't cover your use case.
Instances
Creating
jl create --gpu L4 --storage 40 --yes --json
--gpu is required. Run jl create --help for all available flags.
Lifecycle
Instances have three states that matter: Running (billing active), Paused (compute billing stopped, storage billing continues, data persists), Destroyed (everything deleted).
jl pause <id> --yes --json # stop compute billing, keep data
jl resume <id> --yes --json # restart a paused instance
jl destroy <id> --yes --json # permanently delete
Resume caveats:
- Resume is region-locked — an instance always resumes in its original region. GPU swaps are only possible within that region.
- Resume may return a new machine_id. Always use the returned ID for subsequent operations.
- Run
jl resume --helpfor all available flags (GPU swap, storage expansion, rename, etc.).
SSH, exec, upload, and download only work on Running instances.
Regions & GPUs
Valid region codes for new instances: IN2, EU1.
IN1 is winding down. New instances and filesystems can no longer be created in
IN1. ExistingIN1instances can still be resumed, paused, destroyed, and renamed; existingIN1filesystems can still be listed, resized, and removed. Guide users withIN1resources to the migration doc: https://docs.jarvislabs.ai/in1-migration.
If --region is omitted, the CLI picks a region based on GPU availability.
| Constraint | Detail |
|---|---|
| EU1 | H100 and H200 only, single-GPU launches only right now, 100 GB minimum storage (auto-bumped) |
| VM template | IN2 and EU1 only, requires at least one SSH key, 100 GB minimum storage |
Run jl gpus to check current availability and pricing. Output shows two tables — Containers and VMs — with separate availability for each. The --json output includes a workload_type field ("container", "vm", or null for regions that support both).
Ports & Services
Container instances expose default HTTP ports (each gets its own HTTPS URL):
| Port | Service |
|---|---|
| 8889 | JupyterLab (url field) |
| 7007 | IDE (vs_url field) |
| 6006 | Available on generic templates like pytorch (endpoints[0]) |
VM instances (jl create --gpu ... --vm) get SSH-only access. VMs require at least one SSH key registered (jl ssh-key add). Use ssh_command from jl get <id> --json.
To expose a service (FastAPI, Gradio, etc.), bind to 0.0.0.0:6006 — it's accessible via endpoints[0] on generic templates. Use --http-ports "7860,8080" at creation or resume to expose custom ports. Custom port URLs appear in endpoints after the default 6006 entry.
Run jl get <id> --json to find all service URLs (url, vs_url, endpoints).
Managed Runs
How jl run works
jl run uploads your code to an instance, sets up a Python environment, and runs your script in the background with log and exit code tracking. You need either --on <machine_id> (existing instance) or --gpu <type> (creates a fresh instance).
run_id is tracked locally under ~/.jl/runs/. All run management commands (logs, status, stop, list) depend on these local records. Start and monitor runs from the same machine.
Run targets
| Target | What happens |
|---|---|
train.py |
Uploads to <home>/train.py, runs in <home>/ with shared venv at $HOME/.venv |
. or ./project with --script train.py |
Rsyncs the directory to <home>/<dirname>/, runs inside it with project venv at <home>/<dirname>/.venv |
No target, command after -- |
No upload. Runs from ~. If $HOME/.venv exists (from a previous file run), its bin/ is prepended to PATH so python and pip resolve to venv versions. Otherwise uses system Python. |
Only .py and .sh file targets are supported. For other file types, use a directory target or jl upload + jl exec. Directory targets require rsync installed locally.
Note: File targets with the same basename overwrite each other on the remote (e.g., foo/train.py and bar/train.py both land at /home/train.py). Use directory targets for projects with nested structure.
Pass script arguments after --:
jl run train.py --on <id> --json --yes -- --epochs 50 --lr 0.001
Environment & setup
jl run manages a Python venv on the remote instance. Template packages (torch, etc.) are inherited via --system-site-packages — no need to install them. Venvs persist under the remote home directory across pause/resume.
Venv locations:
- File targets: shared instance-level venv at
$HOME/.venv. All file runs share it — deps installed for one script are available to all. - Directory targets: per-project venv at
<home>/<dirname>/.venv. Isolated per project. - Command mode: no venv is created. If
$HOME/.venvexists from a previous file run,pythonandpipautomatically resolve to it via PATH prepend.
How dependencies get installed:
- Directory targets — auto-detected. If your directory has
requirements.txtorpyproject.toml(with[project]), deps are installed automatically. No flag needed. - File targets — no auto-detection. Pass
--requirements requirements.txtif you need extra packages. --requirements <file>— overrides auto-detection. Uploads and installs the specified file instead.--setup <command>— runs a shell command before your script (e.g.,--setup "pip install flash-attn"). Runs inside the venv for file/dir targets, raw for command mode.
# Directory — auto-detects requirements.txt
jl run . --script train.py --on <id> --json --yes
# Single file — pass requirements explicitly
jl run train.py --on <id> --requirements requirements.txt --json --yes
# Extra setup command
jl run . --script train.py --on <id> --setup "pip install flash-attn" --json --yes
Command mode — when you pass a raw command after -- with no file or directory target. Useful when code already exists on the instance (e.g., uploaded via jl upload, written via jl exec, or left by a previous run). If $HOME/.venv exists from a prior file run, its bin/ is prepended to PATH so python and pip resolve to venv versions. You still get jl run log tracking (logs, status, stop), which is the main advantage over jl exec. --requirements is not supported in command mode.
Important: Command mode runs from ~ (the remote shell home). Use absolute paths or cd explicitly for scripts in specific directories.
jl run --on <id> --json --yes -- python3 /home/train.py
jl run --on <id> --json --yes -- sh -lc 'cd /home && torchrun --nproc_per_node=2 train.py'
Running on an existing instance
jl run train.py --on <machine_id> --json --yes
jl run . --script train.py --on <machine_id> --requirements requirements.txt --json --yes
Lifecycle flags (--keep, --pause, --destroy) are not allowed with --on — the instance is not touched after the run.
Running on a fresh instance
jl run . --script train.py --gpu L4 --keep --json --yes
Creates a new instance, uploads code, runs the script. Additional flags: --vm (VM instead of container, auto-bumps storage to 100GB, disallows --template and --http-ports), --template (default: pytorch; run jl templates --json to list available), --storage (default: 40GB), --num-gpus (default: 1), --region, --http-ports.
Lifecycle rules for fresh instances:
- With
--jsonor--no-follow:--keepis required. The CLI rejects--pauseand--destroybecause it returns immediately and cannot apply lifecycle actions later. Use--keepand have the agent pause or destroy the instance after the run completes. - Without
--jsonor--no-follow(human mode): the CLI stays attached, streams logs, and applies lifecycle when the run finishes. Default lifecycle is--pause.
Use separate jl create when you need to inspect GPU availability, reuse machines across runs, or attach filesystems/scripts.
Monitoring & Control
Reading logs
The primary monitoring command:
jl run logs <run_id> --tail 50
Always use --tail N — without it, the entire log file is returned and can be enormous.
The output includes a header and footer with run state (in non-follow, non-JSON mode):
--- run r_abc | machine 123 | running ---
step=100 loss=2.31
step=200 loss=2.11
--- still running | log: /home/jl-runs/r_abc/output.log ---
When done, the footer shows the final state:
--- succeeded | exit code: 0 | log: /home/jl-runs/r_abc/output.log ---
Or on failure:
--- failed | exit code: 1 | log: /home/jl-runs/r_abc/output.log ---
If the instance is paused, missing, or SSH is unavailable, jl run logs fails before printing any output. Use jl run status <run_id> --json to check those states.
The agent monitoring loop
- Start detached:
jl run ... --json --yes— extractrun_idandmachine_idfrom JSON - Early check (catch fast failures):
sleep 15 && jl run logs <run_id> --tail 30— if footer saysfailed, fix and retry immediately - Steady-state polling:
sleep 120 && jl run logs <run_id> --tail 50 - Read log body for loss values, errors, or progress
- Check footer:
still running→ repeat step 3succeeded | exit code: 0→ download resultsfailed | exit code: N→ read error, fix, start a new run
Cadence: 60-120s (short experiments), 180-300s (long training), 300-600s (very long runs).
Checking status
jl run status <run_id> --json
Returns run state, machine_id, exit_code, lifecycle_policy, launch_command, and more. Without --refresh, jl run list shows state as "saved" (a sentinel, not a real run state). Use --refresh or --status to get live state.
Stopping a run
jl run stop <run_id> --json
Kills the entire process group (training script + all child processes). Escalates to SIGKILL if the process doesn't exit after TERM.
System checks via exec
jl exec <id> -- nvidia-smi
jl exec <id> -- ps -ef
jl exec <id> -- df -h
Prefer raw output for jl exec and jl run logs — easier to read and parse. Use --json when you need machine-readable state: create, get, list, run start, run status.
Exit code of the remote command is propagated. For pipes or shell syntax, wrap in sh -lc:
jl exec <id> -- sh -lc 'grep "loss" /path/to/log | tail -5'
File Transfer & Persistence
Upload and download
jl upload <id> ./local /remote # upload file or directory
jl download <id> /remote ./local # download file
jl download <id> /remote ./local -r # download directory
Default destinations: upload without dest → remote home directory. Download without dest → ./<basename> in current local directory.
What persists across pause/resume
The remote home directory (/home/ on containers, /home/<user>/ on VMs) persists. Everything else is ephemeral.
Persists:
- Files and directories under the home directory
$HOME/.venv(shared venv for file runs) and<project>/.venv(per-project venv for directory runs)- Attached filesystems (mounted at
/home/jl_fs/) - Run metadata under
<home>/jl-runs/<run_id>/
Lost on pause:
- System-level installs (
apt-get, global pip packages outside the home directory) - Files outside the home directory (
/tmp,/root, etc.)
Use --setup for system-level reinstalls (e.g., apt-get). Python packages in the venv persist across pause/resume. For recurring system setup, use startup scripts (jl scripts add).
Remote file paths
<home> is /home/ on containers, /home/<user>/ on VMs.
- Uploaded files (via
jl run):<home>/<filename>(e.g.,train.py→/home/train.py) - Uploaded directories (via
jl run):<home>/<directory_name>/ - Uploaded files (via
jl upload):<home>/<filename> - Shared venv (file runs):
<home>/.venv/ - Project venv (directory runs):
<home>/<directory_name>/.venv/ - Run metadata:
<home>/jl-runs/<run_id>/
Filesystems & supporting commands
Attach a filesystem at creation with --fs-id <id>. Attach a startup script with --script-id <id> (and --script-args). These flags work on both jl create and jl resume.
jl templates --json # list available templates
jl ssh-key list --json # list registered SSH keys
jl ssh-key add <pubkey-file> --name x # add SSH key (required for VMs)
jl scripts list --json # list startup scripts
jl filesystem list --json # list filesystems
jl filesystem create --name x --storage 100 --json # create filesystem
Filesystem caveats:
- Region-bound: A filesystem created in IN2 is only visible to IN2 instances.
- ID changes on edit: Expanding a filesystem (
jl filesystem edit) may return a newfs_id. Always use the returned ID. - The CLI validates that
fs_idexists before creating/resuming, but does not validate region match. Ensure they match yourself.
Agent Workflow (End-to-End)
# 1. Check GPUs and create instance
jl gpus --json
jl create --gpu L4 --storage 50 --yes --json
# 2. Start detached run
jl run . --script train.py --on <machine_id> --requirements requirements.txt --json --yes
# 3. Early check (catch import/syntax/pip failures fast)
sleep 15 && jl run logs <run_id> --tail 30
# 4. Steady-state monitoring (repeat until footer shows succeeded or failed)
sleep 120 && jl run logs <run_id> --tail 50
# 5. Download results (use /home/<user>/ for VMs instead of /home/)
jl download <machine_id> /home/results ./results -r
# 6. Cleanup
jl pause <machine_id> --yes --json
For fresh instances without a pre-created instance:
# Creates instance inline, runs detached — agent must clean up after
jl run . --script train.py --gpu L4 --keep --json --yes
# ... monitor with jl run logs ...
jl pause <machine_id> --yes --json
Error Handling
When --json is active, CLI validation and API failures are emitted as {"error": "..."} to stdout.
Not all non-zero exits use that shape. jl exec --json returns its own structured payload with stdout, stderr, and exit_code fields.
Agent rule:
- First inspect the JSON shape
- If it has an
errorkey, treat it as a CLI failure - Otherwise inspect command-specific fields (
exit_code,state,run_exit_code)
Anti-Patterns
- Do not use
jl run logs --follow— blocks forever, will timeout.--jsonis also incompatible with--follow. - Always use
--jsonwhen starting runs — it returns immediately. Without--json, the CLI streams logs and blocks. - Do not read full logs without
--tail N— can return megabytes of output. - Do not poll every few seconds — use 60-600s intervals based on expected run duration.
- Do not use lifecycle flags (
--keep,--pause,--destroy) with--on— they are rejected. Only for fresh instances. - Do not use
--pauseor--destroywith--jsonfor fresh instances — rejected. Use--keep --jsonand clean up yourself. - Do not use
jl execfor long-running tasks — it blocks until the command finishes. Usejl runwhich runs in the background with log tracking. - Do not trust
jl run listwithout--refresh— state shows as"saved"(stale). Use--refreshor--statusfor live state. - Do not assume
machine_idis stable afterjl resume— it may return a new ID. Always use the returned ID. - Do not forget to pause/destroy instances after experiments — they cost money.
Command Discovery
Every command supports --help for full flag details:
jl create --help jl run --help jl ssh-key --help
jl resume --help jl run logs --help jl filesystem --help