name: nemotron-3-ultra-text2sql-lora description: >- Run the Nemotron-3 Ultra Text2SQL LoRA fine-tuning tutorial (NeMo Megatron-Bridge) end-to-end for the user on their SLURM cluster: data prep, distributed checkpoint conversion, and packed LoRA fine-tuning of the 550B hybrid Mamba-Transformer MoE, ending at a saved adapter. Use when the user wants to run this cookbook, fine-tune Nemotron-3 Ultra with LoRA, or adapt the notebook to their own cluster.
Nemotron-3 Ultra Text2SQL LoRA — runbook for a coding agent
This skill helps you run the cookbook in this directory (mbridge_lora_cookbook.ipynb) on the
user's behalf. The notebook is generic and ships with placeholders; your job is to gather the
user's environment details, fill them in, launch the SLURM jobs, watch them, and report results.
What the tutorial does
Three steps, in order, each a SLURM job:
- Data prep — builds a BIRD Text2SQL
training.jsonlfrom both the no-reasoning and reasoning splits, formatted with Ultra's tokenizer/chat template. Short CPU job. - Convert — distributed import of the Hugging Face base checkpoint into Megatron-Bridge format. A multi-node GPU job (CPU import is not feasible for a 550B model).
- LoRA fine-tune — packed-sequence LoRA training on the prepared data; saves a LoRA adapter. A multi-node GPU job.
What you must understand before running
- Ultra is a 550B-total / A55B-active hybrid Mamba-Transformer MoE. It does not fit on one
node, so every heavy step is a multi-node SLURM job submitted with
sbatchand run in a container via Pyxis/enroot. Run everything from a cluster login node wheresbatch/squeue/sacctare available. - Scale. At the shipped parallel settings, both convert and train need 48 GPUs. Node count
is derived automatically as
48 / GPUS_PER_NODE(e.g. 12 nodes at 4 GPUs/node). The user's QOS must permit a job of that size — an interactive or small-node-capped QOS will not work. - Single config. Everything is driven by one file,
config.env, which the notebook's setup cell generates from the values you fill in. Every step and everyslurm/*.sbatchscript sources it. You can run the notebook cell, or writeconfig.envdirectly with the same keys. - One output root.
WORKSPACEis the single output root; everything generated lands under$WORKSPACE/{base, dataprep, trained, cache/hf, logs}. The base checkpoint (HF_MODEL_PATH) is the only separate, read-only path. - The rhythm per step: a launch cell submits the job, a re-runnable check cell shows
status (
sacct/squeue), and a sanity cell confirms the expected output exists before you move on. Follow this loop; don't skip the sanity check.
Information to gather from the user
Before launching anything, ask the user for the following and confirm the prerequisites. Don't guess these — a wrong value wastes a large multi-node allocation. Prefer asking all of them up front in one batch.
How to reach the cluster
- How do you connect to the login node where SLURM jobs are submitted (e.g. the ssh host)?
- Is there a separate data-transfer host you prefer for large file moves?
SLURM settings
- SLURM account to charge.
- GPU partition and a QOS that allows a multi-node job of
48 / GPUS_PER_NODEnodes (not an interactive or small-node-capped QOS). Confirm the wall-clock limit is enough (convert is short; training is well under a couple of hours by default). - CPU partition and QOS for the short data-prep job.
- GPUs per node on the target nodes (the tutorial targets GB200 at 4 GPUs/node; the node count derives from this).
Paths (all on a shared filesystem the compute nodes can mount)
WORKSPACE— the output root to create/use.HF_MODEL_PATH— where the already-downloaded Ultra base checkpoint lives (read-only input). The tutorial does not download the base model; confirm it is present.- The shared-filesystem root to bind-mount into the container (must contain both
WORKSPACEandHF_MODEL_PATH).
Container & credentials
- The container image to use (path to a prepared image or a registry reference). The notebook ships a placeholder; this must be filled with a real Ultra-capable image.
- A Hugging Face token so BIRD can be downloaded during data prep. The tutorial expects it at
${WORKSPACE}/cache/hf/token; ask the user to place it there (or provide it so you can), and reference it by path — never print or echo a token.
If the user has an environment-reference document for their cluster, ask for it first and pull these values from there instead of asking one by one.
How to run it
- From the login node,
cdinto this cookbook directory (it must be on the shared filesystem). - Fill the config: either edit the notebook's Environment & SLURM Setup cell and run it, or
write
config.envdirectly with the values gathered above. The setup cell has a guard that refuses to proceed while any placeholder (<...>) remains — make sure none are left. - Run the three steps in order. For each: submit via the launch cell/
sbatch, poll the check cell until the job reachesCOMPLETED, then run the sanity cell. - Poll, don't block. These are long-running multi-node jobs. Submit, then check back
periodically with
sacct/squeue— do not hold an interactive session open waiting, and do not stream logs live.
Verifying success per step
- Data prep:
$WORKSPACE/dataprep/training.jsonlexists and has many rows; a sampled record shows the Nemotron-3 chat template. - Convert:
$WORKSPACE/base/latest_checkpointed_iteration.txtplus aniter_*checkpoint dir exist. - Train: under
$WORKSPACE/trained/<experiment-name>/there is alatest_checkpointed_iteration.txtand aniter_*adapter checkpoint; the training log shows the loss trending down and ends with aLORA_TRAIN_DONEmarker.
Report per-step status and elapsed time (from sacct) and the final training loss.
Things already handled — do not change them
- Synchronous checkpoint saving is set on purpose (
async_save=False). Under some container runtimes the async-save path can hang; leave it as configured. - Parallelism / resharding. Convert uses one tensor-parallel layout and train uses another; only
the tensor-parallel degree differs, so the converted checkpoint reshards cleanly on load. Don't
retune these unless you change
GPUS_PER_NODE, in which case keep the world size at 48 GPUs. - Packed sequences and the LoRA target modules (including the Mamba projections) come from the Ultra recipe — no need to configure them.
- Steps are idempotent: data prep skips if
training.jsonlexists; convert skips if the checkpoint already exists. Safe to re-run.
Expected friction (so you don't misread it)
- The first training iteration is slow — graph capture and MoE warmup can take on the order of ~15 minutes with no log output and the GPUs at 100%. This is normal; do not cancel the job. Later iterations are fast.
- A multi-node GPU job that, in the rare case, sits at "loading distributed checkpoint" with zero progress for far longer than the warmup window can be cancelled and resubmitted; a fresh allocation usually clears it.
- Benign noise in the convert log (framework stack-trace fragments, bare NCCL version lines) is not a crash — judge success by the job state and the sanity check, not by log chatter.