reads-mapping - SKILL.md Agent Skill

name: reads-mapping description: Align ChIP-seq or ATAC-seq FASTQ files to a reference genome using Bowtie2, with strict input validation, library layout detection, output organization and logging. Use it when raw sequencing reads must be converted into sorted/indexed BAM files before downstream QC, peak calling, or footprinting.

ChIP-seq / ATAC-seq Sequence Alignment (Bowtie2)

Overview

This skill performs core sequence alignment for ChIP-seq and ATAC-seq data starting from FASTQ files using Bowtie2. It is designed for autonomous execution with explicit user confirmation for biologically important parameters that must not be guessed.

Main steps include:

Refer to the Inputs & Outputs section and create the output architecture in Step 0.
Always ask the user for the reference genome/index to use. Never infer genome build by filename alone.
Always ask the user for assay type if it cannot be confidently determined from file names or project context.
Detect whether input data are paired-end or single-end.
Group FASTQ files into samples using file naming conventions.
Validate that the Bowtie2 index exists before running.
Run alignment with Bowtie2.
Convert SAM to BAM, then sort, index, and generate flagstat and idxstats reports.
For ATAC-seq, preserve metadata and log that downstream duplicate handling and Tn5 shifting occur in later skills unless explicitly requested here.
Write a per-sample parameter log file documenting all chosen options and their justification.

Inputs & Outputs

Inputs

Accepted FASTQ naming patterns include:

${sample}.fastq.gz
${sample}.fq.gz
${sample}_R1.fastq.gz
${sample}_R2.fastq.gz
${sample}_1.fastq.gz
${sample}_2.fastq.gz

Reference resources must be supplied by the user as one of the following:

/path/to/reference.fa
/path/to/bowtie2_index_prefix

Outputs

all_alignment/
  aligned_bam/
    ${sample}.sorted.bam
    ${sample}.sorted.bam.bai
  logs/
    ${sample}_alignment.log
    ${sample}_used_parameters.txt
  temp/

All outputs must be placed under ${proj_dir} returned in Step 0.

Required User Inputs

The agent must ask for the following when missing:

Reference genome build (for example hg38, mm10, dm6)
Reference asset path:
- FASTA path for Bowtie2 index construction, or
- existing Bowtie2 index prefix
Assay type: ChIP-seq or ATAC-seq, if not obvious from file names or prior context
Number of threads if the user has a compute preference; otherwise use a reproducible default
Whether to keep intermediate SAM files; default is no

The agent must not guess:

genome build
reference annotation/version
assay type when ambiguous
single-end vs paired-end if file pairing is incomplete or inconsistent

Decision Logic

Assay Type

Use file and sample names only for tentative classification:

Names containing ATAC, OmniATAC, scATAC → likely ATAC-seq
Names containing TF or histone mark identifiers such as CTCF, MYC, H3K27ac, H3K4me3, H3K27me3 → likely ChIP-seq

If naming is ambiguous, ask the user.

Library Layout

Use FASTQ grouping rules:

If files appear as ${sample}_R1 and ${sample}_R2 or ${sample}_1 and ${sample}_2 → paired-end
If only one FASTQ exists for a sample → single-end
If an R1 file exists without its mate → stop and ask the user to resolve missing mates

Threads

Default to 8 threads unless the user specifies otherwise

Step-by-Step Workflow

Step 0: Initialize Project

Create a task directory for alignment outputs.

Suggested call:

mcp__project-init-tools__project_init

with:

sample: all
task: alignment
genome: provided by user

The tool will return ${proj_dir}. Use it for all output placement.

If a project-init MCP tool is not available in the runtime, create this directory structure manually:

all_alignment/
  aligned_bam/
  qc/
  logs/
  temp/

Set ${proj_dir} to all_alignment.

Step 1: Detect and Group FASTQ Files

Call:

mcp__bowtie2-tools__detect_fastq_samples

with:

input_dir: directory containing FASTQ files

The tool will:

find supported FASTQ files
group them into samples
detect single-end vs paired-end layout
report any missing mate files
return a machine-readable sample summary

Rules:

Prefer compressed FASTQ (*.fastq.gz, *.fq.gz)
Use consistent sample grouping
Fail if file naming is inconsistent or ambiguous

Step 2: Validate or Build Bowtie2 Reference Assets

If the user supplied a FASTA, call:

mcp__bowtie2-tools__build_bowtie2_index

with:

reference_fasta: user-provided FASTA path
index_prefix: desired Bowtie2 index prefix

If the user supplied an existing Bowtie2 prefix, validate it before alignment by calling:

mcp__bowtie2-tools__validate_bowtie2_index

with:

index_prefix: user-provided index prefix

Stop and ask the user to correct the path if validation fails.

Step 3: Align Each Sample

For each detected sample, call:

mcp__bowtie2-tools__run_bowtie2_alignment

with:

sample_name: sample identifier
fastq_r1: path to R1 FASTQ or single-end FASTQ
fastq_r2: path to R2 FASTQ for paired-end data, otherwise omit
assay_type: chipseq or atacseq
index_prefix: validated Bowtie2 index prefix
out_dir: ${proj_dir}/aligned_bam
log_dir: ${proj_dir}/logs
threads: user-specified or default 8
keep_sam: false by default

Tool behavior:

run Bowtie2
write an alignment log
convert to BAM
sort BAM
index BAM
remove intermediate SAM unless keep_sam=true

Expected output:

${proj_dir}/aligned_bam/${sample}.sorted.bam
${proj_dir}/aligned_bam/${sample}.sorted.bam.bai
${proj_dir}/logs/${sample}_alignment.log

Notes:

For ATAC-seq, this skill performs alignment only. Duplicate marking/removal, mitochondrial filtering, proper-pair filtering, and Tn5 shifting belong to downstream preprocessing or peak-calling skills unless explicitly requested in another skill.
Do not remove duplicates here unless the user specifically requests an alternate alignment workflow.

Step 4: Write Parameter Log

For each sample, the agent must write:

${proj_dir}/logs/${sample}_used_parameters.txt

Example content:

Sample: ATAC_rep1
Assay type: ATAC-seq
Library layout: paired-end
Aligner: bowtie2
Reference genome build: hg38
Reference index: /refs/hg38/bowtie2/hg38
Threads: 8
Intermediate SAM kept: no

Reasoning:
- Sample name contains ATAC, so assay classified as ATAC-seq
- Paired FASTQ mates were detected automatically
- User provided hg38 Bowtie2 index
- Alignment-only workflow selected; duplicate handling and Tn5 shifting deferred to downstream preprocessing/peak-calling

Failure Handling

Stop execution and ask the user for correction if any of the following occurs:

no FASTQ files found
inconsistent file naming prevents sample grouping
paired-end mate missing
reference FASTA or index prefix missing
bowtie2 or bowtie2-build executable not found in PATH
samtools not found in PATH
alignment command returns non-zero exit status
sorted BAM or BAM index is not created

Do not continue to downstream QC if alignment fails.

Exact MCP Tool Calls Required

mcp__bowtie2-tools__detect_fastq_samples
mcp__bowtie2-tools__build_bowtie2_index (only when user provides FASTA)
mcp__bowtie2-tools__validate_bowtie2_index
mcp__bowtie2-tools__run_bowtie2_alignment

When the Agent Must Ask the User

The agent must ask before execution when any of the following are missing or ambiguous:

genome build
reference FASTA or Bowtie2 index prefix
assay type when not inferable from context
whether incomplete FASTQ pairs should be excluded or fixed
thread count if a project-specific compute policy exists

The agent must not invent these values.