nextflow-pipeline-debugging

name: nextflow-pipeline-debugging description: Guide for analyzing pipeline output and debugging Nextflow workflows. Use this when you need to inspect channel contents, trace process execution, or analyze intermediate files.

Analyzing and Debugging Pipeline Output

This guide covers techniques for analyzing pipeline output and debugging Nextflow workflows effectively.

Prerequisites

Nextflow pipeline execution (at least one run completed or in progress)
Access to the pipeline working directory and results

Inspecting Channel Contents
Using Workflow Trace Files
Analyzing the Results Folder
Working with the Work Directory
Common Debugging Strategies

Inspecting Channel Contents

Using `.view()` to Debug Channels

The .view() operator is the simplest way to inspect what's flowing through your channels during pipeline execution.

Basic usage:

// View all channel contents
ch_data.view()

// View with a custom label
ch_data.view { "Processing: $it" }

// View with structured output
ch_data.view { meta, file -> 
    "Sample: ${meta.id}, File: ${file.name}" 
}

When to use .view():

Debugging data structure issues (meta maps, file paths)
Verifying channel emissions after operators
Checking data flow between processes
Confirming multiplicity (how many items are emitted)

Example debugging scenario:

// Problem: Not sure what structure the channel has
ch_input
    .view { "Before map: $it" }  // Debug original structure
    .map { meta, bam, bai -> [meta, bam] }
    .view { "After map: $it" }   // Debug transformed structure
    .set { ch_processed }

Using Workflow Trace Files

Understanding Execution Traces

Nextflow generates trace files that provide detailed information about each process execution.

Default location:

results/pipeline_info/execution_trace_YYYY-MM-DD_HH-MM-SS.txt

Reading Trace Files

The trace file is a tab-delimited file with columns including:

task_id: Unique task identifier
hash: Work directory hash (maps to work/XX/YYYYYY...)
name: Process name
status: COMPLETED, FAILED, CACHED, etc.
exit: Exit code (0 = success)
submit, start, complete: Timestamps
duration, realtime: Execution times
%cpu, %mem: Resource usage
rss, vmem, peak_rss, peak_vmem: Memory metrics
rchar, wchar: I/O metrics

Finding Failed Tasks

The CLI output of nextflow will indicate if any tasks failed along with the workfolder hash. You can also use the trace file to find more details about these tasks.

# Find all failed tasks
grep -v "COMPLETED" results/pipeline_info/execution_trace_*.txt | grep -v "CACHED"

# Find tasks with non-zero exit codes
awk -F'\t' '$6 != 0 && NR > 1 {print $2, $3, $4, $6}' results/pipeline_info/execution_trace_*.txt

# Find the work directory for a specific process
grep "PROCESS_NAME" results/pipeline_info/execution_trace_*.txt | awk -F'\t' '{print $2}'

Analyzing the Results Folder

Published Outputs

The results folder contains outputs that have been explicitly published. Use results folders to verify a tool is producing expected outputs, check for anomalies, and compare across samples.

Typical structure:

results/
├── pipeline_info/          # Trace, timeline, DAG, reports
├── [process_name]/         # Process-specific outputs
│   ├── sample1_output.txt
│   └── sample2_output.txt
└── multiqc/               # Quality control reports (if applicable)

Working with the Work Directory

Understanding the Work Directory

Each process execution creates a unique subdirectory in work/ containing:

Staging area: Input files (symlinks or copies)
Output files: All files generated by the process
.command.* files: Execution metadata and logs

Work directory structure:

work/
└── XX/
    └── YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY/
        ├── input_file.txt -> /path/to/actual/file
        ├── output_file.txt
        ├── .command.sh        # The actual script executed
        ├── .command.run       # Wrapper script (with container/env)
        ├── .command.out       # stdout
        ├── .command.err       # stderr
        ├── .command.log       # Combined log
        ├── .command.begin     # Start timestamp
        └── .exitcode          # Exit code

Finding the Work Directory for a Process

Method 1: Using execution trace

# Get the hash for a specific process/sample
grep "PROCESS_NAME.*sample_id" results/pipeline_info/execution_trace_*.txt | \
    awk -F'\t' '{print $2}'

# Navigate to work directory (hash format is XX/YYYYYY...)
cd work/[hash]

Method 2: Using Nextflow CLI

Use the CLI output during execution to find the work directory hash for failed tasks.

Debugging with Work Directory Files

Inspect what command was run:

cat .command.sh              # The actual command
cat .command.run             # Full execution wrapper (with container)

Check outputs and errors:

cat .command.out             # Standard output
cat .command.err             # Standard error
cat .command.log             # Combined log
cat .exitcode                # Exit code (0 = success)

Common Debugging Strategies

Start with the CLI output: Look for any error messages or failed tasks indicated in the terminal output during execution.
Use .view() to inspect channels: Add .view() operators at key points in your workflow to check the structure and contents of channels.
Check the execution trace: Use the trace files to find failed tasks, their work directory hashes, and resource usage patterns.
Inspect the work directory: For failed tasks, navigate to the corresponding work directory and check the command scripts, outputs, and logs for clues about what went wrong.
Compare outputs: If some samples succeed and others fail, compare the outputs and logs between them to identify differences that may indicate the issue.