name: nfcore-scrnaseq-wrapper version: "0.1.0" author: ClawBio description: Wrapper skill for running nf-core/scrnaseq upstream single-cell RNA-seq preprocessing from FASTQ with strict preflight, reproducibility outputs, and downstream handoff to ClawBio scRNA skills. inputs:
- name: samplesheet type: file format: [csv] description: "nf-core/scrnaseq samplesheet CSV with required columns: sample, fastq_1, fastq_2" required: true outputs:
- name: report type: file format: [md] description: Wrapper run summary and downstream handoff recommendations
- name: result type: file format: [json] description: Structured result payload with detected outputs and provenance trigger_keywords:
- scrnaseq
- nf-core scrnaseq
- run scrnaseq from fastq
- preprocess 10x fastqs
- generate h5ad from single-cell fastq
- single-cell preprocessing
- nextflow scrna pipeline
- 10x chromium fastq pipeline
- starsolo upstream processing
- alevin-fry fastq to counts
- run nextflow scrnaseq
- upstream single-cell pipeline
- fastq to h5ad single cell
- 10x genomics fastq pipeline license: MIT metadata: domain: transcriptomics tags: [scrna, single-cell, nextflow, nf-core, fastq, 10x, h5ad, preprocessing] dependencies: python: ">=3.11" packages: [pyyaml] endpoints: cli: python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py --input {samplesheet} --output {output_dir} openclaw: requires: bins: [python3, nextflow, java] always: false emoji: "🧫" homepage: https://github.com/ClawBio/ClawBio os: [darwin, linux]
nfcore-scrnaseq-wrapper
You are nfcore-scrnaseq-wrapper, a specialised ClawBio agent for upstream single-cell RNA-seq preprocessing from FASTQ using the nf-core/scrnaseq Nextflow pipeline.
Trigger
Fire when:
- User wants to run
scrnaseqfrom raw FASTQ files - User asks to preprocess 10x Chromium single-cell data
- User wants to execute
nf-core/scrnaseq - User wants to generate
.h5adfrom raw single-cell FASTQs - User asks for primary scRNA preprocessing (FASTQ → h5ad)
- User mentions
simpleaf,STARsolo,alevin-fry, orkb-pythonfor upstream processing
Do NOT fire when:
- User already has an
.h5adand wants clustering, UMAP, or markers → route toscrna-orchestrator - User asks for scVI, scANVI, batch correction, or dimensionality reduction → route to
scrna-embedding - User asks about bulk RNA-seq, differential expression, or pseudo-bulk analysis → route to
rnaseq-de - Input is an already-processed count matrix, not raw FASTQs
Scope
One skill, one task: run upstream scRNA preprocessing from FASTQ using nf-core/scrnaseq and produce canonical outputs for downstream ClawBio skills.
This skill does NOT perform clustering, normalization, marker detection, dimensionality reduction, or any analysis on the .h5ad it produces.
Why This Exists
- Without it: Users hand-build samplesheets, guess reference combinations, miss backend issues, and struggle to locate the correct
.h5adfor downstream analysis. - With it: One validated command runs the pipeline, captures provenance, writes a reproducibility bundle, and points directly to the best downstream handoff artifact.
- Why ClawBio: The wrapper keeps execution local-first, validates before launching Nextflow, and makes the run chainable into
scrnaandscrna-embedding.
Core Capabilities
- Strict Preflight: Validate Java, Nextflow, backend, samplesheet, FASTQs, and references before execution.
- Curated Presets: Expose all six pipeline modes (
standard,star,kallisto,cellranger,cellrangerarc,cellrangermulti). - Controlled Execution: Always run with
-params-file, a fixed pipeline source, and explicit reproducibility artifacts. - Output Resolution: Detect MultiQC, pipeline_info,
.h5ad,.rds, and select a canonicalpreferred_h5adwhen possible. - Downstream Handoff: Recommend the next command for
scrna-orchestrator(automatic via--run-downstream);scrna-embeddingcan follow as a second step.
Input Formats
| Format | Extension | Required columns (all presets) | Preset-conditional columns | Optional columns |
|---|---|---|---|---|
| Samplesheet | .csv |
sample, fastq_1, fastq_2 |
sample_type + fastq_barcode (required for cellrangerarc); feature_type (required for cellrangermulti) |
expected_cells, seq_center |
| Demo mode | n/a | none — test profile provides its own data | — | — |
The wrapper enforces the preset-conditional columns before execution (samplesheet_builder.py): a cellrangerarc sheet missing sample_type/fastq_barcode, or a cellrangermulti sheet missing feature_type, is rejected with INVALID_SAMPLESHEET. Independently, whenever a sample_type or feature_type value is present — under any preset — it is validated against the nf-core enum (sample_type ∈ {atac, gex}; feature_type ∈ {gex, vdj, ab, crispr, cmo}), matching the property-level enums in assets/schema_input.json, so an invalid value fails fast in preflight rather than late in Nextflow.
Workflow
- Validate: Check the selected preset, samplesheet structure, FASTQ accessibility, references, Java, Nextflow, and backend.
- Normalize: Write a validated samplesheet copy with absolute POSIX paths into the reproducibility bundle.
- Configure: Build one effective
params.yamland a fixed Nextflow command. - Execute: Run
nf-core/scrnasequsing the local sibling checkout when available, or the pinned remote tag. - Parse: Detect MultiQC, pipeline_info,
.h5ad,.rds, and CellBender-derived outputs. - Generate: Write
report.md,result.json, provenance JSON files, and reproducibility artifacts. - Hand off: Recommend the next ClawBio command using the
preferred_h5adwhenhandoff_available = true.
Algorithm / Methodology
The wrapper executes a strictly ordered 7-step pipeline. A failure at any step raises a structured SkillError with an error_code and a fix hint; no subsequent step runs.
Pipeline source resolution (
pipeline_source.py): Prefer a local siblingscrnaseq/checkout (pinned commit, audit-safe). Fall back to the remote pipeline tag when no checkout is found or the checkout path contains whitespace (macOS Docker restriction). A dirty local checkout is rejected by default;--allow-dirty-pipelineis an explicit development-only opt-in that is recorded in provenance. Use--require-local-pipelinewhen fallback to the remote pipeline would be unacceptable.Samplesheet validation (
samplesheet_builder.py): Parse the CSV, resolve FASTQ paths relative to the CSV parent directory, normalize sample-name whitespace to underscores, verify readability and FASTQ extensions, reject FASTQ basenames with whitespace, enforce consistentexpected_cells(≥1) andseq_centerfor repeated sample rows, reject exact duplicate FASTQ rows, and write a normalized copy with absolute POSIX paths toreproducibility/samplesheet.valid.csv.Preflight (
preflight.py): Verify Java (≥17) and Nextflow (≥25.04.0). Compare version tuples after zero-padding to 3 elements (avoids false negatives such as(24, 4) < (24, 4, 0)). Fordocker, rundocker infoand gate on exit code. Forconda/mamba, locate the binary. Cell Ranger presets are rejected with conda/mamba unless--allow-conda-cellrangeris supplied with a trusted site config. Forsingularity/apptainer, accept either binary interchangeably. Forwaveandgpu, no binary check is needed (Nextflow-native features). Safe institutional profile components are accepted for HPC/site profiles, every-c/--configfile must exist before execution, and configs are treated as trusted Groovy code. All preflight subprocess calls have a 60-second timeout (_SUBPROCESS_TIMEOUTinpreflight.py); the git probes inpipeline_source.pyuse a 10-second timeout.Params construction (
params_builder.py): Translate the preset + CLI flags into aparams.yamlconsumed by Nextflow via-params-file. All file paths use.as_posix()for forward-slash consistency across platforms.igenomes_ignoreis automatically set totruewhenever an explicit genome reference (fasta,gtf,transcript_fasta,txp2gene, or any prebuilt index) is provided — auxiliary files (barcode whitelist, CMO/probe/feature sets, primers, multi-barcode samplesheets) never trigger it, so they remain compatible with--genome(suppresses nf-schema DNS validation of the default iGenomes S3 URL). Skip flags are only written whentrue, keepingparams.yamlminimal. In--demomode no reference/protocol params are written at all (the test profile owns them).Command build + execution (
command_builder.py,executor.py): Construct thenextflow runcommand with-params-file, validated-c/--configfiles, and a work directory that defaults to<output>/upstream/workbut may be overridden by--work-dir(including object-store URIs for cloud executors), then launch viasubprocess.Popenwith stdout and stderr piped to log files on disk — never buffered in RAM. OnTimeoutExpired, the process is killed andEXECUTION_FAILEDis raised. OnKeyboardInterrupt, the child process tree is terminated before the interrupt is re-raised.Output parsing (
outputs_parser.py): Scan the upstream results tree for MultiQC HTML,pipeline_info/, aligner output directories,.h5ad(CellBender/filtered preferred over generic combined/raw),.rds, CellBender-derived files, and anofficial_outputsmanifest for documented nf-core output families. Required outputs are validated before success artifacts are written;handoff_availableis set totrueonly when apreferred_h5adis confirmed on disk.Provenance + reporting (
provenance.py,reporting.py): Write JSON provenance bundles, a SHA-256 checksum manifest (files only — never directories),environment.yml, a portablecommands.sh,report.md, andresult.json.
Presets
| Preset | Aligner | Use case |
|---|---|---|
standard |
simpleaf (alevin-fry) | Default for 10x GEX; fast, memory-efficient |
star |
STARsolo | Best FASTQ QC metrics; supports RNA velocity (--star-feature "Gene Velocyto") |
kallisto |
kb-python / BUStools | Pseudo-alignment; fastest; lamanno/nac RNA velocity via --kb-workflow |
cellranger |
CellRanger | CellRanger v2/v3 compatibility; CellRanger is provided by the nf-core container under docker/singularity (no host binary needed). Not available under -profile conda (10x licensing keeps it off bioconda) |
cellrangerarc |
CellRanger ARC | Multiome (GEX + ATAC); accepts prebuilt --cellranger-index or reference-build inputs |
cellrangermulti |
CellRanger Multi | GEX + VDJ + feature barcoding; --cellranger-multi-barcodes required for CMO/FFPE multiplexing |
Each preset requires at least one reference option: --genome <iGenomes_shortcut> OR a pre-built index (--star-index, --simpleaf-index, etc.) OR --fasta + --gtf. The standard/simpleaf preset additionally accepts a transcriptome pair --transcript-fasta + --txp2gene in place of a genome reference (per the nf-core/scrnaseq Simpleaf options).
nf-core/scrnaseq 4.1.0 Compatibility Policy
This wrapper targets nf-core/scrnaseq 4.1.0. It is not a free-form passthrough. Parameters are grouped as:
- Supported upstream parameters: input/output, aligner/preset, reference/index, skip, CellRanger, CellRanger ARC, CellRanger Multi, selected MultiQC/reporting options.
- Wrapper policy parameters:
--preset,--check,--run-downstream,--skip-downstream,--expected-cells,--timeout-hours,--work-dir,--allow-dirty-pipeline,--require-local-pipeline,--allow-pipeline-version-override,--trust-config-params,--allow-conda-cellranger, and-c/--config; these are ClawBio conveniences and are not nf-core parameters. - Deprecated compatibility aliases:
skip_emptydropsis accepted only as--skip-emptydropsand translated toskip_cellbender: true; the deprecated upstream parameter is never written. - Intentionally unsupported upstream parameters:
custom_config_version,custom_config_base,config_profile_name,config_profile_description,config_profile_contact,config_profile_url,version,plaintext_email,max_multiqc_email_size,hook_url,validate_params,pipelines_testdata_base_path,help,help_full,show_hidden.
Unsupported parameters are either hidden/institutional metadata, interactive help/version flags, or options that would weaken the wrapper's fixed validation/reproducibility policy.
Local-first Input Policy (restricted local-first mode)
This wrapper runs nf-core/scrnaseq 4.1.0 in a deliberately restricted, local-first mode — it is not a full-compatibility passthrough of every nf-core input/parameter path. The upstream pipeline can consume remote test-data URLs in some examples (nf-core's CellRanger Multi usage examples use HTTPS FASTQ URLs), but this ClawBio wrapper rejects remote FASTQ URIs by default. Download FASTQs locally first. This prevents accidental cloud access or patient-data movement and keeps all processing local-first. There is no opt-in for remote FASTQs: if you need Nextflow-resolved remote inputs, run the upstream pipeline directly.
This is a deliberate ClawBio wrapper policy, not a reproduction of nf-core's full input flexibility. Two checks are intentionally stricter than the upstream schema (which types reference/FASTQ inputs as plain strings that Nextflow may resolve as URIs at runtime):
- Remote FASTQ URIs in the samplesheet are rejected (
samplesheet_builder.py). - Every supplied reference/index path (
--fasta,--gtf,--star-index, …) must exist on the local filesystem at preflight (preflight.py), so a missing reference fails fast with a clear error instead of a late Nextflow error.
Both are local-first guarantees; environments that rely on Nextflow-resolved remote references should run the upstream pipeline directly.
CLI Reference
# Standard real-data usage (explicit protocol and reference are required)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./scrnaseq_run \
--preset star --protocol 10XV3 --genome GRCh38
# Preflight check only (no Nextflow execution)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./scrnaseq_run --check \
--preset star --protocol 10XV3 --genome GRCh38
# Demo mode (runs the upstream nf-core test profile; forces star preset; uses the
# selected backend — default --profile docker, which must be running)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--demo --output ./scrnaseq_demo
# Via ClawBio runner
python clawbio.py run scrnaseq-pipeline --input samplesheet.csv --output ./scrnaseq_run \
--preset star --protocol 10XV3 --genome GRCh38
python clawbio.py run scrnaseq-pipeline --demo --output ./scrnaseq_demo
# STARsolo with local FASTA+GTF (STAR index built by the pipeline)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset star --protocol 10XV3 \
--fasta /refs/hg38.fa --gtf /refs/hg38.gtf
# STARsolo with prebuilt STAR index
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset star --protocol 10XV3 \
--star-index /refs/star_index
# STARsolo RNA velocity (star requires an explicit --protocol like every star/standard/kallisto run)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset star --protocol 10XV3 \
--star-feature "Gene Velocyto" --star-ignore-sjdbgtf \
--fasta /refs/hg38.fa --gtf /refs/hg38.gtf
# Simpleaf (standard) with UMI resolution override
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset standard --protocol 10XV3 \
--simpleaf-umi-resolution cr-like-em --genome GRCh38
# Kallisto RNA velocity (NAC workflow)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset kallisto --protocol 10XV3 \
--kb-workflow nac --fasta /refs/hg38.fa --gtf /refs/hg38.gtf
# Air-gapped cluster: local iGenomes mirror
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset star --protocol 10XV3 \
--genome GRCh38 --igenomes-base /mnt/local_igenomes
# CellRanger Multi (CMO multiplexing)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset cellrangermulti \
--cellranger-index /refs/refdata-gex-GRCh38 \
--gex-cmo-set /refs/cmo_set.csv \
--cellranger-multi-barcodes /refs/multi_barcodes.csv
Key flags
| Flag | Type | Default | Description |
|---|---|---|---|
--input |
path | — | Samplesheet CSV (sample,fastq_1,fastq_2). Required unless --demo |
--output |
path | — | Output directory for results and the reproducibility bundle (required) |
--demo |
flag | — | Run the upstream nf-core test profile; forces --preset star and skip_cellbender |
--check |
flag | — | Run preflight only and exit (no Nextflow execution) |
--preset |
string | standard |
Aligner preset |
--aligner |
string | — | nf-core/scrnaseq aligner alias for --preset: simpleaf maps to standard; the other values map to same-named presets |
--profile |
string | docker |
Execution backend or comma-separated nf-core profile list: docker, conda, mamba, singularity, apptainer, podman, shifter, charliecloud, wave, gpu, debug, arm64, emulate_amd64, test, test_full, test_cellrangermulti, test_multiome |
--pipeline-version |
string | 4.1.0 |
Remote nf-core/scrnaseq tag or commit (used when no local sibling checkout is found) |
--allow-dirty-pipeline |
flag | — | Development-only opt-in to run a modified local sibling scrnaseq/ checkout; rejected by default for production reproducibility |
--require-local-pipeline |
flag | — | Require a verifiable local sibling scrnaseq/ checkout; fail instead of falling back to the remote pipeline |
--allow-pipeline-version-override |
flag | — | Allow a --pipeline-version other than the pinned 4.1.0 contract (warned, recorded in provenance; validations stay 4.1.0) |
--trust-config-params |
flag | — | Allow -c/--config files that set params.* (otherwise blocked); detected overrides are recorded in provenance |
-c / --config |
path | — | Additional Nextflow config file. May be repeated; files are validated, applied to the live run, copied into the reproducibility bundle, and replayed by commands.sh |
--protocol |
string | None |
Chemistry/protocol forwarded to the aligner. Required for standard, star, and kallisto; omitted only preserves CellRanger auto-detection for cellranger, cellrangerarc, and cellrangermulti. Explicit auto is invalid for standard, star, and kallisto. smartseq is valid for star and kallisto only. Other protocol strings are mapped when known or passed through by nf-core |
--genome |
string | — | iGenomes shortcut (GRCh38, mm10, etc.) — mutually exclusive with --fasta/--gtf and all index flags |
--igenomes-base |
string | — | Base URL or local path for iGenomes (default s3://ngi-igenomes/igenomes/). Use for local mirrors or air-gapped clusters. A local base path is existence-checked in preflight when --genome is set (remote s3:///https:// bases are deferred to Nextflow) |
--igenomes-ignore |
flag | — | Do not load the iGenomes reference config. Set automatically whenever an explicit genome reference is supplied; only needed manually in unusual setups |
--fasta |
path | — | Genome FASTA (.fa, .fna, .fasta, .gz variants; no whitespace in path) |
--gtf |
path | — | Gene annotation GTF |
--star-index |
path | — | Prebuilt STAR genome index directory |
--simpleaf-index |
path | — | Prebuilt simpleaf/alevin-fry index |
--kallisto-index |
path | — | Prebuilt kallisto index |
--cellranger-index |
path | — | Prebuilt CellRanger or CellRanger ARC reference |
--transcript-fasta |
path | — | Transcriptome FASTA for simpleaf |
--txp2gene |
path | — | Transcript-to-gene mapping for simpleaf |
--barcode-whitelist |
path | — | Custom barcode whitelist (per-aligner format) |
--star-feature |
enum | — | STARsolo feature type: Gene, GeneFull, Gene Velocyto |
--star-ignore-sjdbgtf |
flag | — | Do not use GTF for SJDB construction (required for Gene Velocyto) |
--seq-center |
string | — | Sequencing center name for BAM read group tag |
--simpleaf-umi-resolution |
enum | — | UMI resolution strategy for alevin-fry: cr-like, cr-like-em, parsimony, parsimony-em, parsimony-gene, parsimony-gene-em |
--kb-workflow |
enum | — | Kallisto workflow: standard, lamanno, nac |
--kb-t1c |
path | — | cDNA transcripts-to-capture file for RNA velocity (lamanno/nac). Required only with a prebuilt --kallisto-index; auto-generated from --fasta/--gtf |
--kb-t2c |
path | — | Intron transcripts-to-capture file for RNA velocity (lamanno/nac). Required only with a prebuilt --kallisto-index; auto-generated from --fasta/--gtf |
--skip-cellbender |
flag | — | Disable the CellBender ambient RNA removal subworkflow |
--skip-emptydrops |
flag | — | Deprecated compatibility alias for --skip-cellbender; the wrapper writes skip_cellbender: true and never writes deprecated upstream skip_emptydrops |
--skip-fastqc |
flag | — | Skip FastQC quality control |
--skip-multiqc |
flag | — | Skip MultiQC report generation |
--skip-cellranger-renaming |
flag | — | Skip automatic sample renaming in CellRanger modules |
--skip-cellrangermulti-vdjref |
flag | — | Skip mkvdjref in cellrangermulti (when VDJ data is absent or a prebuilt --cellranger-vdj-index is supplied) |
--save-reference |
flag | — | Save the built reference index for future reuse |
--save-align-intermeds / --no-save-align-intermeds |
flag | — | Forward save_align_intermeds: true/false; when neither is given the upstream nf-core/scrnaseq 4.1.0 default (true) is preserved, so intermediate BAMs are published by default — pass --no-save-align-intermeds on large runs to save disk |
--expected-cells |
int | — | Override expected cell count for a single-sample samplesheet; multi-sample runs must set expected_cells per row. The wrapper enforces ≥1 (stricter than the upstream integer schema, which has no minimum) since a non-positive count is meaningless |
--timeout-hours |
float | 12 |
Wall-clock cap for the Nextflow run, in hours. Use 0 to disable the cap for long HPC/cloud runs whose walltime is enforced by the scheduler. Via the ClawBio runner the runner's own timeout also applies |
--work-dir |
string | <output>/upstream/work |
Nextflow work directory. Local paths are resolved before execution; object-store URIs such as s3://... or gs://... are passed through for cloud executors |
--allow-conda-cellranger |
flag | — | Allow Cell Ranger presets with conda/mamba only when a trusted site config supplies Cell Ranger |
--email |
string | — | Email address for pipeline completion notification |
--email-on-fail |
string | — | Email address for pipeline failure notification |
--multiqc-title |
string | — | Custom title for the MultiQC report |
--multiqc-config |
path | — | Custom MultiQC config YAML |
--multiqc-logo |
path | — | Custom MultiQC logo image |
--multiqc-methods-description |
path | — | Custom MultiQC methods-description YAML |
--publish-dir-mode |
enum | — | Forwarded nf-core publish mode: symlink, rellink, link, copy, copyNoFollow, or move |
--trace-report-suffix |
string | — | Suffix for Nextflow trace/report/timeline filenames |
--monochrome-logs |
flag | — | Disable ANSI colors in nf-core logs |
--resume |
flag | — | Nextflow resume (checksum-verified against prior manifest; preset/profile/source/work-dir must match) |
--run-downstream |
flag | — | Opt in to scrna_orchestrator handoff after pipeline completion |
--skip-downstream |
flag | — | Force-skip the downstream handoff even if --run-downstream is given (handoff is off by default) |
--cellrangerarc-config |
path | — | Config JSON for CellRanger ARC index construction |
--cellrangerarc-reference |
string | — | Reference genome name used inside the CellRanger ARC config |
--motifs |
path | — | Motif file (e.g. JASPAR) for CellRanger ARC |
--cellranger-vdj-index |
path | — | Prebuilt CellRanger VDJ reference |
--gex-frna-probe-set |
path | — | Probe set CSV for FFPE fixed RNA profiling (cellrangermulti) |
--gex-target-panel |
path | — | Target panel CSV for targeted GEX (cellrangermulti) |
--gex-cmo-set |
path | — | CMO reference CSV for multiplexed samples (cellrangermulti) |
--gex-barcode-sample-assignment |
path | — | Barcode-to-sample assignment override CSV (cellrangermulti). Not an OCM selector — OCM mode is encoded via the ocm_ids column of --cellranger-multi-barcodes |
--fb-reference |
path | — | Feature-barcode reference CSV for antibody capture (cellrangermulti) |
--vdj-inner-enrichment-primers |
path | — | V(D)J cDNA enrichment primer sequences (cellrangermulti) |
--cellranger-multi-barcodes |
path | — | Multiplexed sample samplesheet for CMO/FFPE demultiplexing (cellrangermulti) |
Output Structure
output_directory/
├── report.md # Wrapper run summary
├── result.json # Structured result payload
├── check_result.json # Written only with --check (preflight-only mode); no upstream/ is produced
├── logs/
│ ├── stdout.txt # Nextflow stdout
│ └── stderr.txt # Nextflow stderr
├── upstream/
│ └── results/ # nf-core/scrnaseq output tree
│ ├── fastqc/ # Per-read FastQC reports
│ ├── multiqc/ # MultiQC HTML and data
│ │ └── multiqc_report.html
│ ├── pipeline_info/ # Execution report, timeline, trace, DAG
│ └── <aligner>/ # Aligner-specific outputs
│ └── mtx_conversions/ # AnnData (.h5ad), SCE (.rds), Seurat (.rds)
│ │ # Per nf-core/scrnaseq 4.1.0 conf/modules.config, the concatenated
│ │ # matrices sit at the TOP of mtx_conversions/ while each per-sample
│ │ # matrix is nested one level deeper under <sample>/ (MTX_TO_H5AD/
│ │ # CONCAT_H5AD/ANNDATA_BARCODES rewrite non-combined files to
│ │ # `${meta.id}/${filename}`). The wrapper scans BOTH depths and does
│ │ # not hard-code filenames: it ranks whatever .h5ad it finds to pick
│ │ # one preferred_h5ad (combined > per-sample; within a group
│ │ # cellbender_filter > filtered > plain > raw, matched on the filename
│ │ # suffix). combined_* filter-suffixed variants are version-dependent.
│ ├── combined_matrix.h5ad # documented concatenated matrix (top level)
│ ├── combined_filtered_matrix.h5ad # version-dependent: filtering ran
│ ├── combined_cellbender_filter_matrix.h5ad # version-dependent: CellBender ran → top preference
│ └── <sample>/ # per-sample matrices nested one level deeper
│ ├── <sample>_raw_matrix.h5ad
│ └── <sample>_filtered_matrix.h5ad # conditional: filtering ran
├── reproducibility/ # Reproducibility + provenance bundle (single directory)
│ ├── samplesheet.valid.csv # Normalized samplesheet (absolute POSIX paths); named samplesheet.demo.csv with --demo
│ ├── params.yaml # Effective Nextflow parameters
│ ├── nextflow_configs/ # Written only when -c/--config is used: copies of the supplied config files, replayed by commands.sh
│ ├── commands.sh # Portable replay script
│ ├── environment.yml # Conda environment spec (for reference)
│ ├── checksums.sha256 # SHA-256 for in-bundle artifacts only; `sha256sum -c` passes from output dir
│ ├── manifest.json # Run metadata: preset, profile, versions, checksums
│ ├── macos_docker.config # macOS+Docker workarounds (VirtioFS, ARM64, STAR FIFOs)
│ ├── remap_paths.py # Helper for replaying on a different machine
│ ├── compatibility_policy.json # Copied policy snapshot (resume/update rules)
│ ├── pinned_versions.json # Copied pinned-versions snapshot
│ ├── inputs.json # Samplesheet + fastq/reference paths and digests (reference_checksums)
│ ├── invocation.json # Timestamp, preset, profile, pipeline version
│ ├── preflight.json # Java/Nextflow/backend info
│ ├── upstream.json # Pipeline source resolution details
│ ├── outputs.json # Detected artifacts
│ ├── runtime.json # Execution timing
│ └── skill.json # Skill name and version
├── provenance/ # Written only with --run-downstream
│ └── handoff.json # Downstream orchestrator path + checksum + outcome
└── scrna_analysis/ # Written only with --run-downstream: scrna_orchestrator output
Example Output
result.json (abbreviated):
{
"skill": "scrnaseq-pipeline",
"version": "0.1.0",
"summary": {
"preset": "star",
"aligner_effective": "star",
"pipeline_source_kind": "remote_repo",
"pipeline_version_or_commit": "4.1.0",
"profile": "docker",
"preferred_h5ad": "<output>/upstream/results/star/mtx_conversions/combined_filtered_matrix.h5ad",
"handoff_available": true,
"samples_detected": 2,
"cellbender_used": false
}
}
report.md closes with:
## Next Steps
- python clawbio.py run scrna --input <preferred_h5ad> --output <dir>
- python clawbio.py run scrna-embedding --input <preferred_h5ad> --output <dir>
Gotchas
- Preflight runs before any Nextflow call. If Java, Nextflow, or the backend are missing or too old, the pipeline never starts and you get a structured JSON error with
error_codeand afixhint. Nextflow ≥25.04.0 is required. - Conda/Mamba profiles need network access at runtime. Preflight only verifies the
conda/mambabinary exists — it deliberately does not probe network connectivity (a flaky check that would also reject valid offline caches). With-profile conda, Nextflow resolves environments from bioconda/conda-forge at runtime, so an offline or proxied host can make a task fail mid-run with a confusing resolver error. Cell Ranger presets are rejected with conda/mamba by default because Cell Ranger is not distributed via bioconda; use--allow-conda-cellrangeronly when a trusted site config provides Cell Ranger. For fully offline/reproducible execution preferdocker/singularity(pinned containers), or pre-warm the conda package cache before running. --genomeconflicts with any explicit genome-reference flag. Providing--genomealongside--fasta,--gtf,--transcript-fasta,--txp2gene, or any prebuilt index raisesCONFLICTING_REFERENCESin preflight. Use either--genome <shortcut>or explicit genome flags — never both. Auxiliary files (barcode whitelist, CMO/probe/feature sets, primers, multi-barcode samplesheets) are not genome references and are compatible with--genome.igenomes_ignoreis set automatically. Whenever an explicit genome reference (fasta,gtf,transcript_fasta,txp2gene, or any prebuilt index) is provided, the wrapper writesigenomes_ignore: trueto suppress nf-schema DNS validation of the default iGenomes S3 URL. Auxiliary files never trigger it. You do not need to set this manually. Use--igenomes-baseonly for local iGenomes mirrors.- Protocol compatibility is enforced before Nextflow starts.
standard,star, andkallistorequire an explicit--protocol, because nf-core/scrnaseq 4.1.0 documentsautoas CellRanger-only. Explicitautois rejected for those presets.smartseqdenotes Smart-seq3 and is accepted forstarandkallistoonly;smartseq2is rejected for every preset because nf-core/scrnaseq 4.1.0 does not support Smart-seq2.cellrangeraccepts onlyautoand10XV1-10XV4;cellrangerarcaccepts onlyauto;cellrangermultiis samplesheet-driven. Unknown custom protocol strings are passed through only forstandard,star, andkallisto, where nf-core documents custom values. save_align_intermedsdefaults to ON in 4.1.0. Upstream nf-core/scrnaseq 4.1.0 setssave_align_intermeds: trueby default, so aligner intermediate BAMs are published into the results tree unless you pass--no-save-align-intermeds. On real (non---demo) datasets these intermediates can be tens to hundreds of GB; disable them when you only need the count matrices.- Deprecated upstream parameters:
skip_emptydropsis deprecated in nf-core/scrnaseq 4.1.0. The wrapper accepts--skip-emptydropsonly as a compatibility alias and emitsskip_cellbender: true. --demoforces preset=star and skip_cellbender=true. The nf-core upstreamtestprofile ships STAR-compatible data and explicitly disables CellBender (which does not work on small test datasets). If a different preset is requested with--demo, the wrapper warns and overrides it.--demois fully hermetic — every pipeline flag is ignored. Thetestprofile owns the entire pipeline configuration: input, references, protocol, and all QC/skip/tuning/save/reporting knobs. Because a-params-filevalue overrides profile config in Nextflow, the wrapper writes only the four forced essentials intoparams.yamlin demo mode —outdir,aligner(star),igenomes_ignore, andskip_cellbender— and nothing else. Any other flag you pass with--demo(e.g.--genome,--fasta, an index,--igenomes-base,--protocol,--star-feature,--skip-fastqc,--save-reference,--seq-center,--publish-dir-mode, …) is ignored and listed in a WARNING. Demo output validation likewise requires FastQC and MultiQC regardless of any--skip-*you pass, since those flags are not written. Drop--demoto run on your own inputs.- The Nextflow run has a 12 h wall-clock cap by default. Large multi-sample STARsolo/CellRanger runs on full genomes can exceed this; raise it with
--timeout-hours <n>or pass--timeout-hours 0to disable the cap entirely (recommended on HPC/cloud where the scheduler enforces walltime). When the run targets an object-store work directory (--work-dir s3://…/gs://…) or an institutional/site profile while the cap is still active, preflight prints a WARNING reminding you to pass--timeout-hours 0, since the scheduler — not the wrapper — should bound such runs. When neither the cap fires nor the job is killed, behaviour is unchanged. - Required outputs are checked after Nextflow exits. The wrapper expects
pipeline_info/, the effective aligner directory, MultiQC unless--skip-multiqc, FastQC HTML/ZIP reports unless--skip-fastqc, and at least one.h5admatrix. FastQC is a hard requirement for every aligner — including the Cell Ranger family (cellranger/cellrangerarc/cellrangermulti) — because nf-core/scrnaseq 4.1.0 runs FASTQC on the shared input-read channel (ch_fastq) before any aligner-specific branching and publishesfastqc/for all of them (the output docs state "FastQC is applied to all aligners' input reads"). A missingfastqc/tree on a non-skipped run is therefore a genuine failure, not a tolerated gap. All six presets run the pipeline'sMTX_TO_H5ADconversion (CellRanger ARC/Multi reuse the CellRanger template), so a completed run with zero.h5adis treated as a failure (EXPECTED_OUTPUTS_NOT_FOUND) rather than a silent success. A present-but-ambiguous selection (e.g. several per-sample matrices, no combined) does not fail the run; it is signalled byhandoff_available = false, and--run-downstreamprints a warning instead of silently returning. preferred_h5admay be absent. If no combined matrix is produced and there are multiple per-sample files,handoff_availableisfalse. Always checkresult.jsonbefore chaining toscrna-orchestratororscrna-embedding.- No arbitrary Nextflow parameter passthrough. Pipeline parameters flow only through the preset system and
params.yaml; no custom--paramflags can be injected. Validated-c/--configfiles are allowed for infrastructure/HPC configuration and are copied into the reproducibility bundle. Note this is a trust boundary, not a sandbox: a Nextflow config is executable Groovy (it can setprocess.beforeScript,process.shell, etc.), so a-cfile is as trusted as code you run yourself. The wrapper validates that each config exists and lints it forparams.*overrides (see the next gotcha), but otherwise does not sandbox its contents — only pass configs you authored or trust. - The work directory defaults to
<output>/upstream/work. This preserves local resume and bundle portability for workstation/HPC runs. Managed cloud-batch executors that need an object-store work directory can pass--work-dir s3://...or--work-dir gs://...; the value is recorded in provenance and replayed bycommands.sh. Local custom paths are resolved before execution and are less portable than the default. - The published results directory (
outdir) is intentionally the local<output>/upstream/results. This is a wrapper policy, not an nf-core limitation: the wrapper parses the results tree to detect thepreferred_h5ad, MultiQC, andpipeline_info/, and to hand off downstream, so results must land on the local filesystem. This is still compatible with cloud-batch executors — the work directory may be remote (--work-dir s3://...) while Nextflow copies the published results back to the localoutdir. There is no--outdiroverride to an object store, because the wrapper could not then parse outputs or chain downstream; for a remote-published run, invoke the upstream pipeline directly. --pipeline-versionis pinned to the 4.1.0 contract. The wrapper's parameter set, protocol matrix and output validation are written for nf-core/scrnaseq 4.1.0, so a different--pipeline-versionis rejected unless you pass--allow-pipeline-version-override(a warned, provenance-recorded opt-in; validations stay 4.1.0).-c/--configfiles may not setparams.*. nf-core advises against setting parameters via-c. The wrapper blocks any config that assigns parameters outside the auditedparams.yaml— dotted (params.aligner = …), bracket (params['aligner'] = …), whole-map (params = […]), or aparams { … }block (including the{on the next line). Pass--trust-config-paramsto allow it (detected overrides are recorded in provenance). Note the config is still trusted Groovy and is not sandboxed; this lint only guaranteesparams.yamlstays the single parameter source.--resumeenforces strict compatibility. The wrapper checks that the stored manifest matches the current preset, profile, pipeline source, effectiveparams.yamlchecksum, and Nextflow work directory. Mismatches raiseINVALID_RESUME_STATE.- RNA velocity requires coordinated flags and is validated in preflight. For STARsolo:
--star-feature "Gene Velocyto"requires--star-ignore-sjdbgtf. For Kallisto--kb-workflow lamanno/nac: when the index is built from--fasta/--gtf,kb refgenerates the capture files, so--kb-t1c/--kb-t2care not required (matching the documented fasta+gtf example); they are required only alongside a prebuilt--kallisto-index, and supplying exactly one of the two is always rejected. Invalid combinations raiseINVALID_PRESET_CONFIGURATIONbefore Nextflow starts. cellrangerarcconfig and reference are paired. Providing--cellrangerarc-configwithout--cellrangerarc-reference(or vice versa) raisesINVALID_PRESET_CONFIGURATION.--motifsis optional and independent.cellrangermultivalidates documented preflight constraints.feature_type=aborfeature_type=crisprrequires--fb-reference(in nf-core/scrnaseq 4.1.0 both Antibody Capture and CRISPR Guide Capture share the singlefb_referencefeature reference — there is no separate CRISPR reference param, so a crispr run without it has no feature reference and fails inside Cell Ranger);feature_type=vdjwith--skip-cellrangermulti-vdjrefrequires--cellranger-vdj-index;--gex-frna-probe-set(FFPE) requires--cellranger-multi-barcodes;feature_type=cmo(CMO multiplexing) also requires--cellranger-multi-barcodes. The three multiplexing modes nf-core documents as mutually exclusive are encoded per multiplexed sample in the--cellranger-multi-barcodessamplesheet via theprobe_barcode_ids(FFPE),cmo_ids(CMO) andocm_ids(OCM) columns. The wrapper parses that samplesheet and rejects any physicalsamplethat populates more than one of those columns (INVALID_PRESET_CONFIGURATION, withconflicting_modes). It additionally rejects the run-level FFPE↔CMO flag combination (--gex-frna-probe-setwithfeature_type=cmo/--gex-cmo-set). Note--gex-barcode-sample-assignmentis a cell/tag-calling override, not an OCM selector, so it is never treated as a mode.cellrangermultiFASTQ filenames are not validated by the wrapper (unlikecellranger/cellrangerarc): Cell Ranger Multi maps libraries through its own multi samplesheet and[libraries]config, so the per-file 10x naming is left to Cell Ranger to validate.- FASTA schema validation. The FASTA path must match
^\S+\.fn?a(sta)?(\.gz)?$(the nf-core/scrnaseq 4.1.0 schema). Paths with whitespace or non-standard extensions are rejected in preflight. - Local checkout must be a sibling directory. The wrapper looks for
../scrnaseqrelative to the ClawBio repo root. If the checkout path contains whitespace (common on macOS), the wrapper warns and falls back to the remote pipeline. EnsureClawBio/andscrnaseq/share the same parent folder. - Local checkout version is verified. A sibling checkout is used only when its exact git tag matches
--pipeline-versionor its commit equals--pipeline-version. Git-less or unverifiable local directories fall back to the remote pinned pipeline. - macOS Docker workaround is applied automatically. On macOS with Docker,
macos_docker.configis written to the reproducibility bundle and passed to Nextflow. It setsstageInMode = "copy"(avoids VirtioFS EDEADLK),--platform linux/amd64(Rosetta emulation), and routes STAR_STARtmpto the container's/tmp(avoids VirtioFS FIFO limitation). These three workarounds apply to every macOS+Docker run. AresourceLimitsblock (cpus 4, memory 15.GB) is written only for--demoruns, withtimeraised to4.h(the upstreamconf/test.configcaps tasks at1.h, too short for STAR genome generation under Apple-Silicon emulation). These ceilings are test-machine sized and are never applied to real datasets (a human STAR index needs far more than 15 GB), so production runs use the host's real resources. One consequence: a--demorun is capped at4.hon macOS+Docker but at the upstream1.hon Linux. Output directories under/tmpemit a WARNING — use a path under HOME.
Safety
- Local-first: User FASTQs and outputs remain on the local filesystem.
- Strict preflight: Nextflow is never invoked if validation fails.
- No hallucinated outputs: Only artifacts confirmed on disk are reported.
- Disclaimer: Every report includes the ClawBio medical disclaimer.
Reproducibility Scope
The bundle guarantees configuration- and provenance-level reproducibility, not bit-for-bit identical scientific results. Be explicit about the boundary:
Guaranteed (captured in the bundle):
- Pipeline pinned by
-r <version>(or a git-verified local checkout commit). - Nextflow engine pinned via
export NXF_VER=<version>incommands.sh. - Effective
params.yaml+ SHA-256;--resumerejects a mismatched params checksum. checksums.sha256(relative labels —sha256sum -cpasses after the bundle is copied to any folder), reference-file digests ininputs.json, and a portable self-anchoringcommands.sh+remap_paths.py.
NOT guaranteed (outside the wrapper's control):
- Bit-for-bit identical outputs. Aligner thread-count can reorder records, and CellBender is stochastic (random-seed driven), so the same inputs+versions may not yield byte-identical
.h5ad. - Container immutability. Images are pinned by the tag nf-core ships, not by an immutable digest; a re-published tag could change the binaries.
conda/mambais not offline-reproducible — environments resolve from channels at run time (preferdocker/singularity).- External reference bytes are not bundled —
--genomepulls from iGenomes S3; local refs are checksummed for provenance but not copied into the bundle.
To approach bit-level reproduction: use docker/singularity (not conda), pin container digests, fix tool seeds where the pipeline allows, and archive the exact reference files alongside the bundle.
Agent Boundary
The agent dispatches and explains; this skill executes.
Agent: Interpret the user's preprocessing intent, choose the preset, and verify that handoff_available is true in result.json before routing to downstream skills.
Skill: Validate environment and inputs, run the pipeline with controlled parameters, write all provenance and reproducibility artifacts, and report the detected preferred_h5ad.
Chaining Partners
| Skill | When to chain |
|---|---|
scrna-orchestrator |
After a successful run, pass preferred_h5ad for clustering, QC, and markers |
scrna-embedding |
Pass preferred_h5ad for scVI/scANVI batch integration and latent embeddings |
multiqc-reporter |
Re-aggregate QC across multiple wrapper runs |
Maintenance
Review cadence: After each nf-core/scrnaseq major release. Check NEXTFLOW_MIN_VERSION (schemas.py), SUPPORTED_PRESETS, SUPPORTED_PROFILES, and this SKILL.md for accuracy.
Staleness signals:
- Preflight rejects a Nextflow version that the current pipeline supports → update
NEXTFLOW_MIN_VERSIONinschemas.pyandreproducibility/pinned_versions.json. - New aligners appear upstream but are absent from
PRESET_ALIGNERS→ add toschemas.pyand update tests. - New reference/index inputs appear upstream → add them to
GENOME_REFERENCE_FIELDSorAUXILIARY_PATH_FIELDSinschemas.py(the single source bothparams_builderandpreflightconsume). Never re-introduce per-module copies. - The pinned
STAR_ALIGN_BASE_EXT_ARGSinnfcore_4_1_0_contract.pydrifts fromconf/modules.config→test_pinned_star_args_match_sibling_checkout_if_presentfails when a sibling checkout is present; update the constant to match. - The VirtioFS macOS workaround (
stageInMode = "copy") is only necessary while Apple Silicon runs Docker via QEMU. Removebuild_macos_docker_config/write_macos_docker_configwhen a native arm64 Docker runtime eliminates VirtioFS deadlocks.
Deprecation criteria: Deprecate if nf-core/scrnaseq releases a Python SDK with equivalent preflight, params, and provenance APIs.