name: snakemake-workflow-architecture description: "Snakemake workflow design for KINTSUGI SLURM processing: lambda resource routing, GPU-only cycle pre-assignment, multi-account scheduling, registration + QC aggregate rules. Trigger: Snakemake rules, workflow config, ruleorder, cycle assignment, DAG design, sentinel files, workflow run/check, registration, VALIS, QC reports." author: KINTSUGI Team date: 2026-02-14
Snakemake Workflow Architecture for Multi-Account SLURM
Experiment Overview
| Item | Details |
|---|---|
| Date | 2026-02-12 |
| Goal | Replace submit.sh orchestration with Snakemake while supporting multi-account GPU+CPU scheduling |
| Environment | HiPerGator HPC, Snakemake >= 8.0, snakemake-executor-plugin-slurm, multiple SLURM accounts |
| Status | Implemented |
Context
The original submit.sh (~1500 lines) handled job orchestration, dependency wiring, skip-existing logic, and dual-pool slot calculation. Snakemake can handle all of this declaratively, but the multi-account GPU+CPU architecture requires careful design since Snakemake has no built-in concept of routing jobs to different accounts based on resource type.
Verified Workflow
3 Per-Cycle Rules with Lambda Resources + 1 Aggregate Rule (NOT 6 rules + ruleorder)
Each per-cycle processing step (stitch, deconvolve, edf) is a single rule. Account, partition, and resource allocation are set per-wildcard via lambda functions in the resources: block. Registration is an aggregate rule with static resources (see below).
rule stitch:
input: ...
output: "{project}/.snakemake_complete/stitch_cyc{cycle}"
resources:
slurm_partition=lambda wc: _assign(wc)["partition"],
slurm_account=lambda wc: _assign(wc)["account"],
gpus=lambda wc: 1 if _is_gpu(wc) else 0,
cpus_per_task=lambda wc: 4 if _is_gpu(wc) else CPU_CPUS,
runtime=lambda wc: RES.get("time_stitch", 240) * (1 if _is_gpu(wc) else CPU_TIME_MULT),
envmodules: ...
shell: "KINTSUGI_DEVICE_MODE={params.device_mode} python workflow/scripts/stitch.py ..."
Cycle Pre-Assignment (_build_cycle_assignment()) — GPU-Only
Cycles are assigned to accounts at DAG creation time (not runtime). All cycles route through GPU — CPU fallback was removed because GPU is 15-20x faster (see gpu-only-scheduling skill).
- GPU queue: Each account contributes
gpu_slotsentries - Overflow: Round-robin across GPU accounts (still GPU mode, cycles queue in SLURM)
No CPU queue — overflow cycles wait for GPU slots to free up, which is dramatically faster than running on CPU.
Sentinel Files for Outputs
Rules use .snakemake_complete/stitch_cyc{cycle} sentinel files because:
- Stitching/deconvolution produce hundreds of files (channels x z-planes)
- EDF produces marker-named files that vary per cycle
- Declaring every output would create an enormous, fragile DAG
A separate validate rule checks that all expected files exist.
Per-Cycle Pipeline Dependencies
Dependencies flow per-cycle, enabling pipelining:
stitch cyc01 → decon cyc01 → edf cyc01 ─┐
stitch cyc02 → decon cyc02 → edf cyc02 ─┼─→ registration (all cycles)
stitch cyc03 → decon cyc03 → edf cyc03 ─┘
Cycle 1's deconvolution starts the moment cycle 1's stitching finishes.
Rule 4: Registration (Aggregate, No Cycle Wildcard)
Registration (multi-cycle alignment via VALIS) processes ALL cycles in a single SLURM job. Unlike rules 1-3, it has no {cycle} wildcard — resources are static, not lambda-driven.
Rule 5: Signal Isolation (Aggregate, CPU-Only)
Added 2026-02-24. Signal isolation (autofluorescence subtraction) runs after registration, processing all registered channels. CPU-only (numpy/scipy), uses _QC_CPU_ASSIGN for partition routing (same as QC rules). See snakemake-signal-isolation skill for details.
Full pipeline: stitch → decon → edf (per-cycle) → registration → signal_isolation (both aggregate) + 5 QC rules.
_REG_ASSIGN = _registration_assignment() # Computed once at DAG creation
rule registration:
input: all_edf_sentinels(), # Depends on ALL EDF cycles completing
output: sentinel="{project}/data/processed/registered/.snakemake_complete",
resources:
slurm_partition=_REG_ASSIGN["partition"], # Static, not lambda
slurm_account=_REG_ASSIGN["account"],
gres="gpu:1" if _REG_ASSIGN["mode"] == "gpu" else "",
script: "scripts/registration.py"
Key design decisions:
_registration_assignment(): Prefers first account withgpu_slots > 0(for VggFD feature detector). Falls back to CPU with OrbFD.- Static resources: No lambda wildcards because there's no
{cycle}wildcard to dispatch on. - Single-cycle handling: If
len(CYCLES) < 2, copies EDF → registered without VALIS (no alignment needed). - Error fallback: On VALIS failure, copies EDF images unchanged (graceful degradation).
- f-string output: Uses
f"{PROJECT}/..."since there's no wildcard — same pattern as QC rules.
Cycle Directory Resolution
_resolve_raw_cycle_dir() handles multiple naming conventions at DAG creation time:
- Long-form:
cyc001_reg001_200210_170925 - Short 3-digit:
cyc001 - Short 2-digit:
cyc01 - Capitalized:
Cyc01
Config Format (workflow/config.yaml)
resources:
accounts:
- name: clive
partition_gpu: "hpg-b200,hpg-turin"
partition_cpu: hpg-default
gpu_slots: 3
cpu_slots: 11
- name: maigan
partition_gpu: "hpg-b200,hpg-turin"
partition_cpu: hpg-default
gpu_slots: 2
cpu_slots: 8
total_gpu_slots: 5
total_cpu_slots: 19
total_slots: 24
cpu_utilization_cap: 0.85
cpu_time_multiplier: 5
cpu_cpus_per_task: 8
Legacy fallback: If accounts list is missing, reads old account_gpu/account_cpu scalars.
Registration parameters are also in config.yaml:
registration:
reference_cycle: 1
max_image_dim: 2048
rigid_only: false
feature_detector: "VggFD"
resources:
mem_registration: 64000
time_registration: 120
cpu_mem_registration: 64000
CLI Commands
| Command | What it does |
|---|---|
kintsugi workflow config . |
Discovers accounts via sacctmgr, generates config + copies Snakefile |
kintsugi workflow check . |
Shows live per-account availability (allocation, in-use, available) |
kintsugi workflow run . |
Submits via Snakemake with auto-calculated -j from live availability |
workflow config always overwrites the Snakefile (so pipeline updates propagate) but only copies scripts/profiles if they don't already exist.
SLURM Profile (workflow/profiles/slurm/config.yaml)
Account and partition are set per-rule in the Snakefile via lambda resources, NOT in the profile. The profile provides:
executor: slurmdefault-resources(mem_mb, runtime, cpus_per_task)latency-wait: 120(NFS propagation tolerance)retries: 2(automatic retry)keep-going: true
Failed Attempts (Critical)
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
6 rules + ruleorder (stitch_gpu > stitch_cpu) |
ruleorder always picks the preferred rule — CPU variants NEVER execute |
Use lambda resources in a single rule to route per-wildcard |
--resources gpus=N to limit GPU jobs |
Doesn't work with multi-account routing; Snakemake can't track per-account budgets | Bake GPU budget into cycle pre-assignment |
| Runtime cycle assignment | Race conditions, no reproducibility | Pre-assign at DAG creation time for deterministic scheduling |
| Declaring all output files per rule | Hundreds of files per stitch/decon (channels x z-planes), fragile DAG | Use sentinel files + separate validation rule |
sacctmgr show user USERNAME format=account |
Returns empty pipe on HiPerGator | Use sacctmgr show associations user=USERNAME format=account -n -P |
Including burst accounts (-b suffix) |
Burst QOS has unreliable memory; OOM kills | Filter out accounts ending with -b |
| Setting account/partition in profile config.yaml | Same settings for all rules — can't route GPU vs CPU | Must be per-rule via lambda in Snakefile |
gpus=1 or gpu=1 resource |
Both trigger SLURM_TRES_PER_TASK conflict on SLURM >= 24.11 |
Use gres="gpu:1" which maps to --gres=gpu:1 |
slurm_extra="'--gres=gpu:1'" |
Plugin blocks --gres in slurm_extra validation |
Use gres resource, not slurm_extra |
No precommand in SLURM profile |
srun on compute nodes inherits bare shell — no conda env, cupy not importable | Add precommand: "module load conda && conda activate KINTSUGI" to profile |
| No SLURM_TRES_PER_TASK cleanup | SLURM >= 24.11 sets this env var in GPU jobs; jobstep plugin's srun inherits it and crashes | Patch jobstep plugin __post_init__() to os.environ.pop("SLURM_TRES_PER_TASK", None) |
| CPU fallback for overflow cycles | 15-20x slower — BaSiC fit() bottleneck (500 DCT iterations/z-plane) |
Route ALL cycles through GPU; queue in SLURM (see gpu-only-scheduling skill) |
Hardcoded max_workers=4 in wrapper scripts |
Wasted half of 8 allocated CPU cores | Read from snakemake.resources.cpus_per_task dynamically |
Per-Channel Skip-Existing (Inside Wrapper Scripts)
Snakemake controls the DAG at the cycle level via sentinel files. If a sentinel is missing, Snakemake reruns the entire cycle. But if a job was interrupted after completing 3 of 4 channels, wrapper scripts now skip completed channels internally:
| Script | Completeness Check | What Counts as "Complete" |
|---|---|---|
stitch.py |
channel_complete(ch) |
All z-plane TIFs exist + result_df.pkl for CH1 |
deconvolve.py |
channel_decon_complete(ch) |
Decon TIF count >= stitched input TIF count |
edf.py |
channel_edf_complete(ch) |
Marker-named output file exists |
Key behaviors:
- Partially-complete channels are fully reprocessed (only skip when ALL expected files exist)
- If ALL channels were already complete, sentinel is still written and job exits 0
- Sentinel files include
skipped=Nfor log visibility - When the sentinel already exists, Snakemake skips the entire rule (coarser, still works)
# Pattern used in all 3 wrapper scripts:
channels_to_process = []
skipped_channels = []
for ch in CHANNELS:
if channel_X_complete(ch):
print(f" Channel {ch} SKIPPED (...)")
skipped_channels.append(ch)
else:
channels_to_process.append(ch)
# Process only incomplete channels
successful = sum(1 for _, ok in results if ok) + len(skipped_channels)
QC Report Rules (Aggregate)
Added 2026-02-13. Three aggregate rules produce rich QC reports after each processing stage completes across all cycles:
rule qc_stitch:
input:
stitch_done=_all_stitch_sentinels(),
output:
sentinel=f"{PROJECT}/qc_plots/.snakemake_complete_stitch",
params:
stage="stitch", ...
resources:
mem_mb=RES.get("mem_qc", 64000),
slurm_partition=_REG_ASSIGN["partition"],
slurm_account=_REG_ASSIGN["account"],
gres="gpu:1" if _REG_ASSIGN["mode"] == "gpu" else "",
script: "scripts/qc_report.py"
Key Design Decisions
| Decision | Rationale |
|---|---|
| Aggregate (all cycles), not per-cycle | QC heatmaps need cross-cycle comparison — per-cycle would miss the point |
Single dispatch script (qc_report.py) |
Avoids duplicating GPU init, path setup, and caching logic across 3 scripts |
Reuses _REG_ASSIGN from registration |
QC rules have no cycle wildcard, can't use _assign(wc) — same pattern as registration |
Cross-stage cache pickles (cache/stitch_stats.pkl) |
Each stage loads prior stage's stats for comparison plots |
Included in rule all via all_qc_sentinels() |
QC runs automatically after processing; rule qc is standalone shortcut |
matplotlib.use("Agg") before Kprocess import |
Compute nodes have no display; must set headless backend first |
| f-string outputs (not wildcard) | QC sentinels use f"{PROJECT}/..." since there's no cycle wildcard |
Config Resources
resources:
mem_qc: 64000 # 64 GB
time_qc: 120 # 2 hours
What Snakemake Replaces vs Keeps
| Replaced by Snakemake | Kept from submit.sh |
|---|---|
| Orchestration & dependency wiring | Python processing scripts (workflow/scripts/*.py) |
.complete marker polling |
KINTSUGI_DEVICE_MODE env var pattern |
--array limit calculation |
Device-adaptive backends (CuPy/NumPy) |
Dual-pool slot calculation (shared via hpc.py) |
Per-channel skip-existing (inside wrapper scripts) |
Both systems produce files in the same data/processed/ tree. They can coexist but should NOT run simultaneously on the same project.
Registration is only available via Snakemake (not in submit.sh). It runs as a single aggregate job after all per-cycle EDF processing completes.
Key Insights
- Lambda resources are the key mechanism — Snakemake has no built-in multi-account concept, but lambda functions in
resources:give per-job control - Pre-assignment beats runtime scheduling — Deterministic, reproducible, and avoids race conditions
- Sentinel files are a pragmatic compromise — Trade strict output tracking for a manageable DAG
- Per-channel skip-existing complements sentinel-level control — Snakemake handles cross-rule dependencies; wrapper scripts prevent re-doing completed work within a single interrupted job
sacctmgr show associationsis the correct SLURM query —show userformat doesn't work on all clusters- GPU-only scheduling — CPU is 15-20x slower for BaSiC; overflow cycles queue for GPU instead (see
gpu-only-schedulingskill) - Dynamic worker counts — Wrapper scripts read
cpus_per_taskfrom Snakemake resources, not hardcoded 4 - Always overwrite the Snakefile — Pipeline logic changes must propagate; user customization goes in config.yaml, not the Snakefile
precommandactivates the existing conda env — Compute nodes needmodule load conda && conda activate KINTSUGIto access cupy, torch, etc.- SLURM_TRES_PER_TASK must be unset for jobstep srun — SLURM >= 24.11 bug; patch survives until upstream fix in
snakemake-executor-plugin-slurm-jobstep - GPU resource: use
gres="gpu:1"— Maps to--gres=gpu:1; bothgpusandgpuresources trigger SLURM 24.11 TRES conflicts - Aggregate rules use static assignment, not lambdas — Registration and QC rules have no
{cycle}wildcard, so use_REG_ASSIGN(from_registration_assignment()) computed once at DAG creation - Per-file script copy in
workflow config— New scripts auto-propagate to projects without overwriting user-customized scripts
References
- KINTSUGI CLAUDE.md - "Snakemake Workflow" and "Multi-Account Architecture" sections
gpu-only-schedulingskill - Why CPU fallback was removed (15-20x performance data)slurm-concurrent-processingskill - Dual-pool resource calculation (submit.sh version, historical)slurm-workflow-integrationskill - Original SLURM CLI integrationworkflow/Snakefile- Implementationsrc/kintsugi/hpc.py-detect_multi_account_resources(),detect_live_multi_account()src/kintsugi/cli.py-workflow config/check/runcommands