name: bio-read-alignment-bwa-alignment description: Aligns DNA short reads (paired- or single-end) to a reference genome with bwa-mem2, the maintained successor to BWA-MEM, for WGS/WES and germline/somatic variant-calling pipelines; covers index build, read-group injection, the collate/fixmate/sort/markdup ordering, soft-clipping for SV split reads, ALT/decoy-aware mapping on GRCh38, -K determinism, and streaming straight to a sorted BAM. Use when mapping DNA short reads to a reference for variant calling, coverage, ChIP/ATAC (alongside bowtie2-alignment), or SV detection. RNA-seq spliced alignment is star-alignment/hisat2-alignment; BAM sort/dedup/stats, the QC gate, and the cross-tool MAPQ scale are alignment-files; read trimming is read-qc; counting reads over features is rna-quantification. tool_type: cli primary_tool: bwa-mem2
Version Compatibility
Reference examples tested with: bwa-mem2 2.2.1+, bwa 0.7.17+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- CLI:
<tool> --versionthen<tool> --helpto confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
BWA-MEM2 Alignment -- Read Groups, the Reference, and the Output Contract Decide Downstream Truth
"Align my DNA reads" -> Map reads to a reference with seed-and-extend, inject read groups, and stream to a sorted BAM -- because the aligner is the easy part: the reference analysis set, the @RG metadata, the -M/-Y output flags, and the dedup ordering are the decisions that silently determine whether variant calling works.
- CLI:
bwa-mem2 mem -t 8 -R '@RG\tID:s1\tSM:s1\tPL:ILLUMINA\tLB:lib1' ref.fa R1.fq.gz R2.fq.gz | samtools sort -@4 -o aligned.bam -
Scope: DNA short-read mapping with bwa-mem2 (and original BWA-MEM / bwa-aln), the GRCh38 analysis-set choice (below), and the output contract it must satisfy for GATK/DeepVariant/SV callers. BAM sort/dedup/index/stats, the QC gate (flagstat/idxstats interpretation), and contig-naming reconciliation -> alignment-files. The cross-tool MAPQ scale -> alignment-files/sam-bam-basics. Read trimming/QC -> read-qc. Variant calling -> variant-calling. Structural-variant calling consumes the split reads produced here -> variant-calling/structural-variant-calling. OUT OF SCOPE: RNA (use star-alignment/hisat2-alignment), long reads (long-read-sequencing/long-read-alignment), bisulfite (methylation-analysis/bismark-alignment).
The Single Most Important Modern Insight
- Read groups are a hard contract, not metadata decoration. GATK (HaplotypeCaller/Mutect2/BQSR) and Picard refuse to run or behave wrongly without
@RG. SM names the sample callers group by; ID is the BQSR error-model unit; PL sets the error model; LB is the unit MarkDuplicates dedups within (two reads at the same coordinate from different libraries are NOT duplicates). Inject them at mapping time with-R; adding them later (Picard AddOrReplaceReadGroups) is a full BAM rewrite. The catastrophic error is a clean-looking BAM that GATK rejects or mis-merges because SM/LB are missing. - On GRCh38, ALT/decoy handling is a correctness decision and a blind +ALT mapping is actively worse than no ALT. ALT contigs are alternate haplotypes of hyperpolymorphic loci (MHC/HLA); a read matches the primary copy AND the ALT copy, becomes a multimapper, and MAPQ collapses to 0 -- so variant callers drop it and HLA variants vanish. bwa-mem rescues this only if the
<idxbase>.altfile is present (it then scores non-ALT hits against non-ALT hits only); adding ALT contigs to the FASTA WITHOUT the.altimports the ambiguity and discards the fix. Use a decoy (hs38d1) always -- it absorbs reads from sequence missing from the primary assembly that would otherwise mismap and make recurrent false SNPs. The analysis-set layering is below. - The -M/-Y flags and the dedup ordering are downstream contracts that fail silently.
-Mmarks split (chimeric) pieces as secondary (0x100) for legacy Picard -- but modern tools read supplementary (0x800) natively, and-Mhides the split-read evidence SV callers need, so do NOT use it for SV work; use-Y(soft-clip supplementary) so every split piece keeps its full sequence. Duplicate marking has a strict order:collate (name) -> fixmate -m -> sort (coordinate) -> markdup; the-madds the mate-score/MC tags markdup requires, fixmate needs mates adjacent, markdup needs coordinate order. Any other order silently produces wrong duplicate flags.
What MAPQ Means Here (and what it does not)
bwa-mem MAPQ runs 0-60 and is an ordinal confidence rank derived from the gap between the best and second-best alignment score and the seed coverage -- it is NOT a calibrated -10*log10 P(wrong). It clusters bimodally at 0 (a competing locus scores as well -> multimapper) and 60 (no competitor, good seed support); the middle is sparse. It models only repeat ambiguity, not contamination or heuristic error, so MAPQ 60 means "no competing locus found," not "P(wrong)=1e-6". A -q 20 pre-filter drops most multimappers; rely on the caller's own MAPQ handling (GATK ignores MAPQ-0 and duplicates) rather than double-filtering aggressively. This scale is bwa-specific -- a -q 60 "unique" filter empties a Bowtie2 (cap 42/44) or STAR (255=unique) BAM; see alignment-files/sam-bam-basics for the cross-tool scale.
Tool Taxonomy
| Tool / subcommand | Citation | Mechanism / role | When |
|---|---|---|---|
bwa-mem2 mem |
Vasimuddin 2019 IEEE IPDPS 314-324 | architecture-aware reimplementation of BWA-MEM; near-identical output, ~1.5-3x faster, ~2x RAM, different index | the default DNA aligner today |
bwa mem |
Li 2013 arXiv:1303.3997 | SMEM seed + chain + banded affine SW; the reference implementation defining "correct" | when bwa-mem2 is unavailable or to match a legacy pipeline |
bwa aln + samse/sampe |
Li & Durbin 2009 Bioinformatics 25:1754 | bounded backtracking, no exact-seed requirement | ancient DNA / very short (<70 bp) damaged reads where the 19-mer seed fails |
bwa bwasw |
Li & Durbin 2010 Bioinformatics 26:589 | SW over BWT for long/divergent queries | legacy, superseded by mem; rarely used |
| DRAGEN / Parabricks fq2bam | -- | hardware-accelerated BWA-MEM (FPGA/GPU); concordant, not bit-identical | high-throughput production where a validated pipeline accepts the small delta |
| STAR / HISAT2 | -- | splice-aware (route OUT) | any RNA library -> star-alignment, hisat2-alignment |
Decision Tree by Scenario
| Scenario | Recommended | Why |
|---|---|---|
| Human WGS/WES germline or somatic | bwa-mem2 mem + GRCh38 + decoy (+ALT/postalt if HLA matters); add -Y |
seed-and-extend standard; GATK/DeepVariant expect its soft-clipped, ALT-aware output |
| SV / split-read detection | bwa-mem2 mem -Y (no -M) |
supplementary alignments with full soft-clipped sequence are the split-read signal |
| Reproducible / functional-equivalence pipeline | add -K 100000000 (and -Y) |
pins per-batch insert-size estimation so output is thread-count-invariant |
| ChIP-seq / CUT&RUN | bwa-mem2 or bowtie2 | both work; bowtie2 is the ENCODE peak-assay default -> bowtie2-alignment |
| ATAC-seq | bowtie2 (local/dovetail) | adapter read-through + short fragments favor local mode -> bowtie2-alignment |
| Ancient / very short damaged DNA | bwa aln -l 1024 -n ~0.01-0.03 -o 2 + samse/sampe |
mem's mandatory 19-mer seed fails on short damaged reads, biasing toward reference |
| Allele-specific (ASE / allelic binding) | bwa-mem2 then WASP-filter | reads carrying the alt allele align worse -> reference bias at het sites |
| RNA-seq | route OUT to star-alignment / hisat2-alignment | spliced reads need an N-CIGAR-capable aligner |
Default when uncertain: bwa-mem2 mem with read groups, streamed to a coordinate-sorted BAM, on a decoy-containing GRCh38 analysis set; add -Y for any pipeline that calls SVs.
Build Index
bwa-mem2 index reference.fa
# emits reference.fa.0123 .amb .ann .bwt.2bit.64 .pac (NOT interchangeable with `bwa index`'s .bwt/.sa)
For human DNA, obtain the decoy/ALT analysis set rather than a bare GRCh38: bwakit's run-gen-ref hs38DH downloads hs38DH.fa (primary + ALT + hs38d1 decoy + HLA) AND the hs38DH.fa.alt lift file, which must sit next to the index basename for ALT-aware mapping; the GATK/Broad resource bundle ships an equivalent decoy reference. Pick the layer by what the downstream caller can handle: no-alt + decoy (hs38) when the caller is not ALT-aware (the safe minimum), or full + decoy (hs38DH) with ALT-aware mapping plus bwa-postalt.js when MHC/HLA accuracy matters -- mismatching mapper-awareness to the analysis set silently degrades MHC/segdup accuracy. Use GRCh38 over GRCh37, and never mix builds across a cohort.
Align with Read Groups, Stream to a Sorted BAM
bwa-mem2 mem -t 8 \
-R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1' \
reference.fa reads_1.fq.gz reads_2.fq.gz | \
samtools sort -@ 4 -o aligned.sorted.bam -
samtools index aligned.sorted.bam
# single-end: drop reads_2.fq.gz. The literal \t in -R must survive the shell (single quotes do this).
Mark Duplicates (the strict ordering)
# collate -> fixmate -m -> sort -> markdup. -m adds the ms/MC tags markdup requires.
bwa-mem2 mem -t 8 -R '@RG\tID:s1\tSM:s1\tPL:ILLUMINA\tLB:lib1' reference.fa R1.fq.gz R2.fq.gz | \
samtools collate -@ 4 -O -u - | \
samtools fixmate -m -@ 4 -u - - | \
samtools sort -@ 4 -u - | \
samtools markdup -@ 4 - aligned.markdup.bam
samtools index aligned.markdup.bam
# NEVER run this on amplicon/multiplex-PCR data: identical primer-defined ends are by design, so
# markdup deletes almost all real coverage. Use UMIs (fgbio/UMI-tools) for PCR-dup removal there.
SV / Split-Read Mapping
# -Y soft-clips supplementary alignments so every split piece keeps its full sequence (SA-tag chain).
# Do NOT add -M (it demotes split pieces to secondary and hides the SV evidence).
bwa-mem2 mem -t 8 -Y -R '@RG\tID:s1\tSM:s1\tPL:ILLUMINA' reference.fa R1.fq.gz R2.fq.gz | \
samtools sort -@ 4 -o aligned.sv.bam -
Reproducible Output Across Thread Counts
# Per-batch insert-size estimation makes default multithreaded output vary; -K pins the batch size.
bwa-mem2 mem -t 8 -K 100000000 -Y \
-R '@RG\tID:s1\tSM:s1\tPL:ILLUMINA\tLB:lib1' \
reference.fa R1.fq.gz R2.fq.gz | samtools sort -@ 4 -o aligned.bam -
Ancient / Very Short Damaged DNA
# mem's 19-mer seed fails on short deaminated reads; bwa aln backtracks with no exact-seed requirement.
bwa index reference.fa
bwa aln -l 1024 -n 0.02 -o 2 -t 8 reference.fa reads.fq.gz > reads.sai
bwa samse reference.fa reads.sai reads.fq.gz | samtools sort -o ancient.bam -
# -l 1024 disables the seed (seed longer than the read); -n relaxes edit distance; -o 2 allows gaps.
# The -n value is benchmark-dependent (commonly ~0.01-0.03); treat as a tunable starting point.
Key Parameters (verified against bwa/bwa-mem2 source defaults)
| Flag | Default | Effect |
|---|---|---|
| -t | 1 | threads |
| -k | 19 | min seed (SMEM) length; lower = more sensitive on short/divergent reads, slower; the reason aDNA defeats mem |
| -r | 1.5 | re-seed a MEM longer than k*1.5; lower = more sensitive, slower |
| -c | 500 | discard a seed occurring > N times (NOT 10000; the old sourceforge page is stale) |
| -A / -B / -O / -E | 1 / 4 / 6 / 1 | match / mismatch / gap-open / gap-extend; lower -B for divergent data, lower -O/-E to permit longer indels |
| -L | 5 | clip penalty; higher discourages soft-clipping (toward end-to-end) |
| -T | 30 | min alignment score to OUTPUT (below it the read is reported unmapped) |
| -M | off | mark split hits secondary (legacy Picard); harmful for SV -- prefer -Y |
| -Y | off | soft-clip supplementary alignments (keep full sequence for SV callers) |
| -K | auto | input bases per batch; fix it (e.g. 100000000) for thread-count-invariant output |
| -R | -- | the @RG header line (literal \t) |
Per-Method Failure Modes
Missing read groups
Trigger: mapping without -R '@RG...'. Mechanism: GATK groups reads by SM and models error by ID/LB. Symptom: GATK errors ("no read group"), or MarkDuplicates/BQSR misbehave; samples cannot be told apart. Fix: inject SM/ID/PL/LB at mapping time; Picard AddOrReplaceReadGroups is a rewrite if missed.
GRCh38 + ALT without the .alt file
Trigger: ALT contigs in the FASTA but no <idxbase>.alt. Mechanism: ALT copies become ordinary contigs, so every MHC/HLA read multimaps and MAPQ collapses to 0. Symptom: a MAPQ-0 spike and missing variants in MHC/HLA. Fix: supply the .alt (alt-aware mapping) + bwa-postalt.js, or use a no-alt + decoy analysis set if not ALT-aware (see Build Index above).
-M used in an SV pipeline
Trigger: bwa-mem2 mem -M ... then split-read SV calling. Mechanism: -M marks split pieces secondary (0x100), which SV callers and MarkDuplicates skip. Symptom: SV caller finds little split-read support. Fix: drop -M, add -Y, then dedup, then call -> variant-calling/structural-variant-calling.
fixmate without -m, or wrong dedup order
Trigger: samtools markdup on a name-sorted BAM, or fixmate without -m. Mechanism: markdup needs coordinate order plus the ms/MC tags -m writes. Symptom: silently wrong duplicate flags; over- or under-marking. Fix: collate -> fixmate -m -> sort -> markdup -> alignment-files/duplicate-handling.
Index built with the wrong tool
Trigger: pointing bwa-mem2 at a bwa index directory or vice versa. Mechanism: bwa-mem2 uses .bwt.2bit.64; bwa uses .bwt/.sa. Symptom: "fail to locate the index" / format error. Fix: rebuild with the matching tool (bwa-mem2 index).
Reference bias at heterozygous sites
Trigger: allele-specific analysis (ASE, allelic ChIP/ATAC, low-VAF somatic) off a raw BAM. Mechanism: the alt-allele read carries an extra mismatch, sometimes failing the 19-mer seed -> reference allele over-counted. Symptom: false allelic imbalance toward reference; depressed alt-allele VAF. Fix: WASP-filter (re-map allele-swapped reads, keep only if placement is unchanged), N-mask, a personalized reference, or a SNP-graph/pangenome.
Quantitative Thresholds
| Threshold | Source | Rationale |
|---|---|---|
| min seed -k 19 | bwamem.c::mem_opt_init | sets the sensitivity floor; the reason short damaged aDNA reads fail mem |
| seed cap -c 500 | bwamem.c::mem_opt_init | above this a seed is uninformative and explodes extension; verified default (not 10000) |
| scoring -A1 -B4 -O6 -E1 | bwamem.c::mem_opt_init | the substitution-rate the scheme tolerates is ~0.75*exp(-log(4)*B/A) |
| min output score -T 30 | bwamem.c::mem_opt_init | reads scoring below 30 are reported unmapped |
| -K 100000000 for reproducibility | CCDG / functional-equivalence pipelines | pins per-batch insert-size estimation independent of thread count |
| bwa-mem2 human index/runtime RAM ~10 GB | bwa-mem2 README (approximate) | ~2x original bwa; plan node memory accordingly |
| GRCh38 + hs38d1 decoy for human WGS | GATK/CCDG guidance | decoy absorbs missing-assembly reads that otherwise make recurrent false variants |
Common Errors
| Error / symptom | Cause | Solution |
|---|---|---|
| "fail to locate the index" | bwa vs bwa-mem2 index mismatch | rebuild with the matching tool (bwa-mem2 index) |
| GATK "no read group" | missing -R '@RG...' |
inject SM/ID/PL/LB at mapping time |
| Low mapping rate | wrong reference/species, un-trimmed adapter, contamination | confirm species with read-qc/contamination-screening; check the reference build; trim first -> read-qc; interpret stats -> alignment-files/bam-statistics |
| MAPQ-0 spike, missing HLA variants | GRCh38+ALT without .alt, or no decoy |
use decoy + ALT-aware mapping/postalt, or no-alt+decoy (see Build Index) |
| Wrong/unstable BAM across runs | default per-batch insert estimation with multithreading | add -K 100000000 and fix input order |
| Almost all reads marked duplicate | MarkDuplicates run on amplicon/PCR data | do not dedup amplicon; use UMIs -> read-qc/umi-processing, alignment-files/duplicate-handling |
| markdup mis-marks duplicates | name-sorted input or fixmate without -m |
collate -> fixmate -m -> sort -> markdup |
References
- Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997.
- Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754-1760.
- Li H, Durbin R. 2010. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589-595.
- Vasimuddin M, Misra S, Li H, Aluru S. 2019. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. IEEE IPDPS 2019:314-324.
- Li H, Handsaker B, Wysoker A, et al. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078-2079.
- van de Geijn B, McVicker G, Gilad Y, Pritchard JK. 2015. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat Methods 12:1061-1063.
Related Skills
- bowtie2-alignment - ChIP/ATAC DNA mapping with end-to-end vs local modes
- star-alignment - RNA splice-aware alignment (when reads cross junctions)
- read-qc/fastp-workflow - Trim and QC reads before alignment
- read-qc/umi-processing - UMI extraction/dedup for amplicon and low-input libraries (do not coordinate-dedup amplicon)
- alignment-files/duplicate-handling - Mark/remove duplicates; UMI-aware dedup
- alignment-files/sam-bam-basics - SAM flags, CIGAR, the cross-tool MAPQ scale, SA tags
- alignment-files/bam-statistics - flagstat/idxstats/stats QC gate; what a high mapping rate hides
- variant-calling/variant-calling - Call variants from the aligned BAM
- variant-calling/structural-variant-calling - SV calling from split/discordant reads (needs -Y)