name: nf-pipeline-to-galaxy-workflow description: Convert a complete Nextflow pipeline to Galaxy
Nextflow Pipeline to Galaxy Workflow
When to Use
Use this skill when:
- Converting a complete Nextflow pipeline (e.g., nf-core pipeline)
- Creating a full Galaxy solution with multiple workflows
- Planning a large-scale conversion project
Don't use this skill if:
- Converting a single process (use
nf-process-to-galaxy-toolinstead) - Converting a single subworkflow (use
nf-subworkflow-to-galaxy-workflowinstead)
Step-by-Step Process
Step 1: Clarify Workflow Scope with User
REQUIRED before starting conversion - ask the user:
What parts of the Nextflow pipeline to convert?
- Full pipeline end-to-end?
- Specific subworkflow only?
- Exclude certain optional steps?
- Example: "Do you want preprocessing + alignment + variant calling + annotation, or just variant calling?"
Quality control and reporting requirements:
- Include QC steps (FastQC, etc.)?
- Include aggregated reporting (MultiQC)?
- Include intermediate QC checks?
Best-practice scientific / bioinformatics sanity check (REQUIRED):
- If the requested scope has conceptual issues (e.g., missing required preprocessing steps like sort/index, omitting essential QC, incompatible inputs/reference/annotation, or a too-small subset that would violate common analysis best practices), flag the issue.
- Ask the user whether they want to:
- Expand/adjust the scope to address the best-practice issue(s), or
- Proceed with the user’s requested scope as-is (explicitly confirmed).
Workflow metadata:
- Workflow name (descriptive, user-facing)
- Author/Creator name(s)
- License (e.g., MIT, Apache-2.0, GPL-3.0)
- Description/Annotation
- Tags for categorization
NEVER use placeholder values - always get real information from the user. NEVER assume scope - explicitly confirm what to include/exclude.
Step 2: Analyze Pipeline Structure
Read the main Nextflow file and all imported workflows/subworkflows.
Document:
- All processes used
- Data flow between processes
- Conditionals and branches
- Input/output patterns
Step 3: Check Tool Availability and Requirements (CRITICAL)
For each process, verify a corresponding Galaxy tool exists:
- Search tools-iuc repository and verify tool XML exists
- Read the tool XML to confirm:
- Exact tool ID and current version
- Actual input/output parameter names (for connections)
- Conditional parameter structures
- Whether tool requires pre-built databases/indices
- NEVER assume a tool exists without verification
- NEVER use placeholder or non-existent tools
- If a tool doesn't exist, inform user and discuss alternatives
CRITICAL: Verify tool owner/repository (COMMON ERROR):
- Tools can exist under multiple owners (e.g.,
devteam,iuc,bgruening) - Check the actual ToolShed at https://toolshed.g2.bx.psu.edu/ to confirm correct owner
- Example:
freebayesis underdevteam, NOTiuc - Example:
samtools_sortis underiuc, NOTdevteam - Tool ID format:
toolshed.g2.bx.psu.edu/repos/{owner}/{repo}/{tool}/{version} - Wrong owner = "tool not available" error even if tool exists
- Always verify owner before using tool in workflow
Check for tool dependencies:
- Does the tool need a database built first? (e.g., SnpEff needs database, not just GTF)
- Does the tool need sorted/indexed input? (e.g., variant callers need sorted BAM + index)
- Are there helper tools needed? (e.g., samtools sort/index after alignment)
Use the check-tool-availability.md reference.
Prefer a splitter approach when the Nextflow workflow:
- Has multiple “modes” controlled by flags (
--mode,--skip_*, etc.) - Has large optional branches triggered by optional inputs
- Exposes a very large parameter surface (many knobs that overwhelm a Galaxy form)
- Produces materially different outputs depending on configuration
Splitter pattern:
- Create 2–N intent-focused Galaxy workflows with simpler inputs
- Keep an “advanced” variant only when needed
- Minimize workflow conditionals; prefer separate workflows when branches are substantial
Deliverable: your plan should explicitly state whether you are producing:
- One Galaxy workflow
- Multiple related Galaxy workflows (recommended for complex pipelines)
- Optional “meta-workflow” that chains subworkflows (only if it improves usability)
Example (CAPHEINE):
CAPHEINE Pipeline:
├── PIPELINE_INITIALISATION (setup)
├── CAPHEINE (main analysis)
│ ├── PROCESS_VIRAL_NONRECOMBINANT (preprocessing subworkflow)
│ ├── HYPHY_ANALYSES (analyses subworkflow)
│ └── DRHIP (aggregation)
└── PIPELINE_COMPLETION (reporting)
CAPHEINE splitter example (foreground branch): CAPHEINE has an optional branch controlled by providing a foreground regexp/list. In Galaxy, it may be clearer to publish two workflows:
Workflow A: CAPHEINE (no foreground)
- Inputs: reference genes, unaligned sequences
- Runs the baseline HyPhy analyses
- Outputs: baseline result set
Workflow B: CAPHEINE (with foreground)
- Inputs: reference genes, unaligned sequences, foreground regexp/list
- Runs additional foreground-dependent HyPhy analyses
- Outputs: baseline + additional foreground-dependent results
This keeps each workflow form smaller and avoids large conditional sections.
Step 2: Check Tool Availability for All Processes
Use: ../check-tool-availability.md and ../scripts/check_tool.sh
Create comprehensive tool inventory:
cd ../
# List all unique tools used in pipeline
for tool in tool1 tool2 tool3; do
./scripts/check_tool.sh $tool
done
Document all findings in a table:
| Process | Tool | Status | Action |
|---------|------|--------|--------|
| PROCESS_A | tool_a | ✅ Exists | Use existing |
| PROCESS_B | tool_b | ❌ Missing | Create tool |
Step 3: Create Conversion Plan
Plan must include:
Tool inventory (from Step 2)
- Existing tools and their locations
- Missing tools that need creation
Tool creation strategy (for missing tools)
- Which should go in tools-iuc?
- Which should be custom?
- Ask user about tools-iuc access
Workflow structure
- Single workflow or multiple related workflows?
- If multiple: how do they differ by intent/inputs/outputs?
- Nested subworkflows?
- Where do you intentionally split to reduce conditionals/knobs?
Implementation order
- Which tools to create first
- Which workflows to build first
- Testing strategy
Present plan to user and wait for approval before implementing.
See: ../tool-sources.md for tool placement decisions
Step 4: Create Missing Tools
For each missing tool, use the nf-process-to-galaxy-tool skill.
Wait for all tools to be created and validated before continuing.
Step 5: Build Workflows
For each subworkflow, use the nf-subworkflow-to-galaxy-workflow skill.
Caveats (important for most pipelines):
UUIDs must be in proper UUID4 format:
- Galaxy validates that all
uuidfields are valid UUID4 strings (e.g.,550e8400-e29b-41d4-a716-446655440000).
- Galaxy validates that all
UUIDs must be unique across the entire workflow JSON:
- Galaxy rejects workflows if any
uuidis duplicated (common failure: copying the workflowuuidinto a stepuuid, or reusing the same step UUID across multiple steps). - Check uniqueness for:
- Workflow-level
uuid - Every
steps.<n>.uuid - Every
steps.<n>.workflow_outputs[*].uuid
- Workflow-level
- Validate before import (e.g., parse JSON and ensure the set of UUIDs has no duplicates).
- Galaxy rejects workflows if any
Interpret Galaxy warnings correctly: The user may provide a list of warnings produced by Galaxy for a workflow they try to import.
- Benign (often OK):
- Warnings for optional advanced parameters (e.g. “Disable grouping… Using default False”).
- These usually just indicate the workflow JSON omitted optional parameters and Galaxy filled defaults.
- Usually indicates a real workflow bug:
- Warnings where the missing value is a dataset input (e.g. “BAM file … default ''”, “VCF Data … default ''”, “GFF dataset … default ''”).
- This typically means:
input_connectionsuses the wrong parameter name, or- a conditional selector in
tool_statewas not set, so Galaxy expects a different input branch.
- Environment mismatch (action required):
- “Tool is not installed” → the target Galaxy instance doesn’t have that exact
tool_id. - “Using version X instead of version Y specified” → Galaxy substituted an installed version.
- This is often OK, but you must re-check parameter names/outputs against the installed tool.
- “Tool is not installed” → the target Galaxy instance doesn’t have that exact
- Benign (often OK):
These two mistakes are extremely common — preempt them:
- Tool ID / owner / version mismatch:
- The workflow may reference a real tool, but under the wrong owner or an uninstalled version.
- Symptoms: “Tool is not installed”, “tool not available”, or silent version substitution.
- Mitigation: verify ToolShed owner + inspect the tool XML for the exact
tool_id+ version.
- Conditional branch / input key mismatch:
- The tool may have the correct upstream dataset available, but Galaxy still warns that the dataset input is empty.
- Symptoms: “No value found for 'BAM file'… default ''” / “No value found for 'GFF dataset'… default ''”.
- Mitigation: set the correct selector in
tool_stateand use the exact conditional parameter path ininput_connections.
- Tool ID / owner / version mismatch:
Tool forms often require setting conditional selectors in
tool_state:- Many tools expose inputs only after a selector is set (e.g. “use cached genome vs history dataset”, “single vs collection”, “region mode on/off”).
- Even if you wire
input_connectionscorrectly, Galaxy may still treat the input as missing unless the selector branch is chosen intool_state.
Validation checkpoint (recommended):
- After generating the
.ga, import into the target Galaxy and scan for:- Any “Tool is not installed” messages (fix tool IDs/owners/versions)
- Any dataset-input warnings defaulting to empty (fix
input_connectionskey names and/or conditional selectors) - Any version substitutions that might change parameter/output names
- If any dataset-input warnings remain, treat the workflow as not runnable until resolved.
- After generating the
Iterate using Galaxy’s import warnings report:
- Do not assume the user will proactively provide these warnings.
- If your first draft doesn’t import cleanly, ask the user to:
- Import the
.gainto their Galaxy instance - Copy/paste the resulting import warnings report
- Import the
- Use that report to quickly identify which failures are:
- Missing tools (instance mismatch)
- Wrong tool IDs/owners/versions
- Wrong
input_connectionskeys - Missing conditional selectors in
tool_state
Tool existence and repository owner must be verified:
- Tools can exist under multiple owners; wrong owner produces “tool not available” errors.
- Confirm tool IDs via ToolShed URLs and/or tool XML in the authoritative repository.
Tool existence must be verified (CRITICAL):
- NEVER reference a tool without verifying it exists in tools-iuc or target repository.
- Check the actual tool XML file to confirm tool ID and versions.
- NEVER use placeholder or assumed tool names.
- If a tool doesn't exist, inform user and discuss alternatives
Tool semantics must be validated (tool may exist but still be wrong) (CRITICAL):
- Do not stop at “the tool exists” — confirm it performs the intended transformation.
- If semantics are unclear, ask the user what behavior is expected before drafting the step.
- Example (CAPHEINE):
seqkit_split2exists, but it splits into parts/chunks; for “one FASTA record → one dataset in a collection”, use an appropriate splitter tool (e.g. ToolShedrnateam/splitfasta/rbc_splitfasta).
Tool input/output connections are REQUIRED (CRITICAL):
- Every tool step must have
input_connectionsthat wire it to upstream steps (except workflow inputs). - Tools without connections will not receive data and will fail at runtime.
- Each connection must specify: upstream step
id+ exactoutput_namefrom that step. - Read the actual tool XML to get exact input parameter names - do not guess.
- Parameter names often use conditional paths (e.g.,
reference_cond|reference_historynotreference). - Input names vary by tool (e.g., CAwlign uses
fastanotquery). - Output names must match tool definitions (e.g.,
labeled_treenotoutput). - Incorrect connections cause execution failures even if import succeeds.
- Explicitly tell user which connections need verification.
- Every tool step must have
Dataset collections vs individual datasets:
- Paired FASTQ inputs should use
data_collection_inputtype withcollection_type: "paired". - Tools that process collections need collection-aware parameter names (check tool XML).
- Some tools iterate over collections, others process them as a unit.
- Collection outputs have
type: "input"to preserve collection structure.
- Paired FASTQ inputs should use
Tool dependencies and helper steps:
- Some tools require pre-built databases (e.g., SnpEff needs database built from GTF, not raw GTF).
- Some tools require sorted/indexed input (e.g., variant callers need sorted BAM + BAI index).
- Add helper tools as needed (e.g., samtools sort/index after alignment).
- Check Nextflow workflow for these preprocessing steps.
Tool versions in drafted
.gafiles are often placeholders:- If you have access to the target Galaxy instance (UI or API), resolve each step's
tool_id/tool_versionto what is actually installed (tool revisions and+galaxyNsuffixes differ). - If you cannot check the instance, use the most recent version of each tool and treat any
tool_id/tool_versionyou emit as a placeholder and explicitly tell the user they must verify/adjust versions against their Galaxy.
- If you have access to the target Galaxy instance (UI or API), resolve each step's
Recommended structure:
- Break pipeline into logical workflow units
- Create subworkflows first
- Compose into main workflow(s)
Example (CAPHEINE):
Workflow 1: CAPHEINE_Preprocessing
- Input: Raw sequences
- Output: Alignment + Tree
Workflow 2: CAPHEINE_Analyses
- Input: Alignment + Tree
- Output: Analysis results + DRHIP report
Step 6: Integration Testing
Use the canonical testing docs:
- Tool testing (Planemo):
../../tool-dev/references/testing.md - Workflow testing/validation (Galaxy instance):
../../galaxy-integration/galaxy-integration.md
../testing-and-validation.md is a short routing page that links to these.
At minimum:
- Upload pipeline test data
- Run workflows in sequence
- Compare outputs to Nextflow results (structural/semantic match)
- Document differences
Step 7: Documentation
Create documentation for users:
- Which workflows to run in what order
- Input data requirements
- Expected outputs
- Tool versions used
- Differences from Nextflow version (if any)
Quick Reference
Nextflow pipeline = Multiple Galaxy workflows + tools
Decomposition strategy:
- Identify logical workflow boundaries
- Check all tool availability
- Create missing tools (use
nf-process-to-galaxy-tool) - Build subworkflows (use
nf-subworkflow-to-galaxy-workflow) - Compose and test
Resources
All detailed guides are in parent directory (../):
check-tool-availability.md- Tool availability checkingtool-sources.md- Where to create toolsworkflow-to-ga.md- Workflow creation guidenextflow-galaxy-terminology.md- Conceptual mappingstesting-and-validation.md- Routing page to canonical testing docs
Related skills:
nf-process-to-galaxy-tool- For creating missing toolsnf-subworkflow-to-galaxy-workflow- For building workflow components
Complete Example: CAPHEINE Pipeline
See ../examples/capheine-mapping.md for full CAPHEINE conversion.
Key findings:
- ✅ 15/15 tools exist in tools-iuc (100% coverage)
- ✅ No tool creation needed
- ✅ Conversion is purely workflow assembly
- Recommended: 2 workflows (preprocessing + analyses)
This demonstrates the ideal case: Most established pipelines use common tools that already exist in Galaxy.