name: bio-read-alignment-star-alignment description: Aligns RNA-seq reads to a genome with STAR, the fast splice-aware aligner whose splice-junction database (built from a GTF at sjdbOverhang = readlength-1) and two-pass mode set junction sensitivity, whose 255-for-unique MAPQ breaks GATK, and whose GeneCounts output reveals library strandedness. Use when RNA reads must be placed on the genome for novel-isoform discovery, fusion detection, RNA variant calling, coverage tracks, splicing QC, or single-cell (STARsolo). Memory-constrained RNA alignment is hisat2-alignment; DE on known transcripts only should skip alignment for rna-quantification/alignment-free-quant; the QC gate and contig-naming reconciliation are alignment-files; counting is rna-quantification; DNA is bwa-alignment/bowtie2-alignment. tool_type: cli primary_tool: STAR
Version Compatibility
Reference examples tested with: STAR 2.7.11+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- CLI:
<tool> --versionthen<tool> --helpto confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
STAR RNA-seq Alignment -- The Junction Database, the 255 MAPQ, and the Strand Column Decide the Result
"Align my RNA-seq reads" -> Map reads across exon-exon junctions to the genome, building or reusing a splice-aware index from a GTF -- because an RNA read spans introns the genome does not contain, so the junction database (and its sjdbOverhang), the two-pass choice, the 255-unique MAPQ, and the strand column STAR reports are what actually determine the downstream counts and variant calls.
- CLI:
STAR --runMode alignReads --genomeDir idx/ --readFilesIn R1.fq.gz R2.fq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate
Scope: RNA splice-aware mapping with STAR -- index generation, two-pass, GeneCounts, chimeric/fusion output, and the MAPQ/strandedness traps. Contig-naming reconciliation and the QC gate -> alignment-files. Low-memory RNA alignment -> hisat2-alignment. Counting reads over genes/transcripts -> rna-quantification. Single-cell droplet/plate counting (STARsolo) -> single-cell. DE on known transcripts without a BAM -> rna-quantification/alignment-free-quant. OUT OF SCOPE: DNA (bwa-alignment/bowtie2-alignment), long reads (long-read-sequencing/long-read-alignment).
The Single Most Important Modern Insight
- The splice-junction database is part of the index/run config, and a wrong sjdbOverhang or per-sample two-pass silently changes the answer. STAR builds short artificial junction-flank sequences from the GTF so reads with a short overhang on one exon can still be placed;
--sjdbOverhangsets the flank length and should equalmax(readlength) - 1(default 100 is fine near 100 bp reads but degrades junction sensitivity for very short reads). Two-pass (--twopassMode Basic) discovers novel junctions and re-aligns, raising novel-junction sensitivity -- but it is PER-SAMPLE: each sample is re-aligned against its own augmented index, so a junction found only in the deep/disease sample is rescued asymmetrically, a batch effect that confounds junction/splicing/sQTL comparisons. The cohort-correct recipe is to pool every sample's pass-1SJ.out.tab, filter, and feed one common--sjdbFileChrStartEndto a uniform second pass. Per-sample two-pass is fine for plain gene-level DE; it bites splicing analyses. - STAR's MAPQ is 255-for-unique and a multiplicity code, so it breaks GATK and a copied MAPQ filter deletes every multimapper. Unique reads get MAPQ 255 -- the SAM "mapping quality unavailable" value -- which GATK treats as missing and drops, the classic silently-empty RNA VCF; fix it at align time with
--outSAMmapqUnique 60. Multimappers get 3 / 1 / 0 for 2 / 3-4 / >=5 loci (pure locus count, no score information), so a genericsamtools view -q 10orfeatureCounts -Q 10copied from a DNA pipeline DELETES every multimapper while keeping every unique -- a directional "discard paralog/gene-family/rRNA/pseudogene reads" filter that biases against recently-duplicated gene families. A hard MAPQ filter is almost never what an RNA analysis wants. - STAR reports library strandedness for free in GeneCounts, and getting strand wrong roughly halves counts.
--quantMode GeneCountsemits a 4-columnReadsPerGene.out.tab: gene, unstranded, forward, reverse. Summing columns 3 and 4 reveals the protocol -- roughly equal is unstranded (use col 2), col 3 dominant is forward, col 4 dominant is reverse (the common dUTP/TruSeq case, use col 4). Never assume the strand: feeding the wrong column (or the wrong-sto a counter) sends sense reads to "no feature" and roughly halves the counts, distorting DE.
How STAR Places a Splice (the mechanism in brief)
STAR seeds with Maximal Mappable Prefix search on an uncompressed suffix array: it finds the longest exact prefix, and when that prefix ends (at a junction, mismatch, or read end) it restarts from the next base -- so a junction-spanning read naturally decomposes into seeds on two exons that STAR then stitches across the intron with an N (skipped-region) CIGAR. The stitch is scored with splice priors that penalize non-canonical motifs (--scoreGapNoncan -8), long introns (--scoreGenomicLengthLog2scale -0.25), and reward annotated junctions (--sjdbScore 2), plus a hard anchor floor (--alignSJoverhangMin): a novel junction on a tiny anchor carries almost no information, so it must clear both the soft score prior and the hard overhang minimum. This is why annotation (the sjdb) and two-pass matter for short-overhang and novel junctions.
Tool Taxonomy
| Mode / output | Citation | Mechanism / role | When |
|---|---|---|---|
STAR --runMode alignReads |
Dobin 2013 Bioinformatics 29:15 | MMP seed + stitch; spliced genomic BAM | the default RNA-to-genome alignment |
--twopassMode Basic |
Dobin 2013 (STAR manual) | discover novel junctions, re-align | novel-isoform / variant / splicing work (per-sample); pool for cohorts |
--quantMode GeneCounts |
STAR manual | per-gene counts + the strand-detection 3 columns | quick counts and strandedness inference |
--quantMode TranscriptomeSAM |
STAR manual | transcriptome-coord BAM for RSEM / Salmon-aln | isoform-level EM quantification -> rna-quantification |
--chimSegmentMin -> Chimeric.out.junction |
STAR manual | split-across-loci reads for fusion calling | STAR-Fusion / Arriba fusion detection |
STARsolo (--soloType) |
STAR manual | cell-barcode + UMI single-cell counting | scRNA-seq (route OUT) -> single-cell |
| HISAT2 | Kim 2019 Nat Biotechnol 37:907 | graph FM-index, ~1/4 the RAM | memory-constrained RNA (route OUT) -> hisat2-alignment |
| Salmon / kallisto | Patro 2017 Nat Methods 14:417 | alignment-free transcript quantification | DE on known transcripts only (route OUT) -> rna-quantification/alignment-free-quant |
Decision Tree by Scenario
| Scenario | Recommended | Why |
|---|---|---|
| RNA-seq, ample RAM (>=32 GB), need a genomic BAM | STAR | fastest splice-aware aligner; native counts, fusions, 2-pass |
| RNA variant calling | STAR 2-pass + --outSAMmapqUnique 60 then GATK SplitNCigarReads |
2-pass splices novel junctions; 60 avoids the 255 drop |
| Novel-isoform / splicing / sQTL across a cohort | cohort 2-pass (pool SJ.out.tab -> common --sjdbFileChrStartEnd) |
per-sample 2-pass is a junction batch effect |
| Fusion detection | --chimSegmentMin 12 --chimOutType ... -> STAR-Fusion / Arriba |
chimeric junctions are the fusion signal |
| Single-cell RNA | STARsolo (route OUT) | barcode+UMI counting -> single-cell |
| Memory-constrained (<32 GB) | route OUT to hisat2-alignment | STAR needs ~30 GB for human |
| DE on known transcripts only | route OUT to rna-quantification/alignment-free-quant | Salmon/kallisto are faster and model multimapping better |
| Small genome (bacterial/viral/plasmid) | STAR with reduced --genomeSAindexNbases |
the default 14 silently builds a bad index / segfaults |
Default when uncertain: STAR with --outSAMtype BAM SortedByCoordinate, --quantMode GeneCounts (to also read strandedness), and --twopassMode Basic for per-sample novel-junction work; set --outSAMmapqUnique 60 for any STAR -> GATK path.
Generate the Genome Index
# sjdbOverhang = max(readlength) - 1 (149 for 2x150 reads). Default 100 degrades junctions for short reads.
STAR --runMode genomeGenerate --runThreadN 8 \
--genomeDir star_index/ \
--genomeFastaFiles genome.fa \
--sjdbGTFfile annotation.gtf \
--sjdbOverhang 149
# Small genome (e.g. 5 Mb): add --genomeSAindexNbases <= min(14, log2(GenomeLength)/2 - 1), or STAR segfaults.
Basic Alignment
STAR --runThreadN 8 \
--genomeDir star_index/ \
--readFilesIn reads_1.fq.gz reads_2.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate
samtools index sample_Aligned.sortedByCoord.out.bam
# STAR already coordinate-sorts -- a subsequent `samtools sort` is redundant. Single-end: one file in --readFilesIn.
Two-Pass + Gene Counts + Strandedness
STAR --runThreadN 8 --genomeDir star_index/ \
--readFilesIn r1.fq.gz r2.fq.gz --readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate \
--twopassMode Basic \
--quantMode GeneCounts \
--outSAMattrRGline ID:sample1 SM:sample1 PL:ILLUMINA LB:lib1 \
--outSAMmapqUnique 60 # so a downstream GATK RNA-variant step does not drop the 255 uniques
# STAR read groups use --outSAMattrRGline (SPACE-separated tags), NOT bwa's tab-delimited -R '@RG\t...'.
# GATK requires read groups; comma-with-spaces separates groups for multiple --readFilesIn files.
# Detect strandedness from ReadsPerGene.out.tab (skip the 4 N_* summary rows, sum cols 3 vs 4):
awk 'NR>4 {f+=$3; r+=$4} END {printf "fwd(col3)=%d rev(col4)=%d -> use col %s\n", f, r, (f>2*r?"3 fwd": r>2*f?"4 rev":"2 unstranded")}' sample_ReadsPerGene.out.tab
ENCODE Long-RNA-seq Parameter Set
STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn r1.fq.gz r2.fq.gz --readFilesCommand zcat \
--outFileNamePrefix sample_ --outSAMtype BAM SortedByCoordinate \
--outFilterType BySJout \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 \
--alignSJoverhangMin 8 --alignSJDBoverhangMin 1 \
--sjdbScore 1 --outSAMattributes NH HI AS NM MD
# BySJout keeps only reads whose junctions passed the dataset-wide collapse; multimapNmax 20 retains real
# multi-locus genes; the 0.04 mismatch ratio scales with read length; intron caps at ~1 Mb cover human genes.
Fusion Detection
# Chimeric junctions for STAR-Fusion (params per the STAR-Fusion wiki).
STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn r1.fq.gz r2.fq.gz --readFilesCommand zcat \
--outFileNamePrefix sample_ --outSAMtype BAM SortedByCoordinate \
--chimSegmentMin 12 --chimJunctionOverhangMin 12 --chimOutJunctionFormat 1 \
--chimOutType Junctions
# Arriba instead reads chimeric alignments from the BAM: --chimSegmentMin 10 --chimOutType WithinBAM SoftClip.
Key Parameters (STAR defaults unless noted)
| Parameter | Default | Description |
|---|---|---|
| --sjdbOverhang | 100 | junction-flank length at index build; set to readlength-1 |
| --twopassMode | None | Basic for per-sample novel-junction discovery |
| --outSAMmapqUnique | 255 | MAPQ for unique reads; set 60 for GATK |
| --outFilterMultimapNmax | 10 | max loci to report (ENCODE uses 20 for RNA) |
| --alignIntronMin / Max | 21 / 0 (auto) | gap < min is a deletion; cap Max ~1 Mb for human |
| --alignSJoverhangMin / SJDBoverhangMin | 5 / 3 | novel / annotated junction anchor floor (ENCODE 8 / 1) |
| --quantMode | -- | GeneCounts and/or TranscriptomeSAM |
| --chimSegmentMin | 0 (off) | turn on chimeric/fusion detection |
| --genomeSAindexNbases | 14 | reduce to min(14, log2(L)/2-1) for small genomes |
| --genomeLoad / --limitBAMsortRAM | NoSharedMemory / -- | shared-memory reuse; explicit sort RAM |
Per-Method Failure Modes
STAR 255 MAPQ into GATK
Trigger: STAR BAM (uniques at 255) fed to GATK RNA variant calling. Mechanism: GATK reads 255 as "MAPQ unavailable" and drops the read. Symptom: a silently empty or near-empty RNA VCF. Fix: --outSAMmapqUnique 60 at align time (then SplitNCigarReads in the GATK RNA workflow).
MAPQ filter deletes every multimapper
Trigger: a -q 10 / -Q 10 MAPQ filter (copied from DNA) on STAR output. Mechanism: STAR multimappers are MAPQ <= 3, uniques 255, so the filter keeps only uniques. **Symptom:** systematic under-counting of paralog/gene-family/rRNA/pseudogene loci; DE driven by multimapper fraction. **Fix:** do not MAPQ-filter RNA for counting; handle multimappers in the counter (NH-aware) or via EM -> rna-quantification/featurecounts-counting.
Per-sample two-pass as a batch effect
Trigger: --twopassMode Basic per sample for a splicing/junction comparison. Mechanism: each sample is aligned against its own novel-junction-augmented index. Symptom: junction recovery correlates with depth/condition, confounding sQTL/differential splicing. Fix: pool pass-1 SJ.out.tab across the cohort, filter, feed one --sjdbFileChrStartEnd to a uniform second pass.
sjdbOverhang mismatched to read length
Trigger: index built with default 100 for 36-50 bp reads, or a mismatch between index build and align values. Mechanism: junction flanks far longer than the reads degrade short-overhang sensitivity; a mismatch errors at align time. Symptom: reduced novel-junction-spanning sensitivity, or "present sjdbOverhang not equal to genome generation step." Fix: rebuild with sjdbOverhang = max(readlength)-1.
Small genome with default SAindexNbases
Trigger: indexing a bacterial/viral/plasmid genome with --genomeSAindexNbases 14. Mechanism: the SA pre-index string is too long for the genome. Symptom: STAR silently builds a bad index or segfaults at align time. Fix: set --genomeSAindexNbases min(14, log2(GenomeLength)/2 - 1) (1 Mb -> 9, 100 kb -> 7).
Index built with a different STAR version
Trigger: an index built months ago loaded by a bumped STAR module. Mechanism: STAR refuses an index whose versionGenome differs. Symptom: "Genome version is INCOMPATIBLE with running STAR version," or a subtly different version that loads and behaves differently across a cohort. Fix: rebuild the index with the exact aligning STAR version; pin the version for the whole cohort.
Quantitative Thresholds
| Threshold | Source | Rationale |
|---|---|---|
| sjdbOverhang = max(readlength) - 1 | STAR manual | sets the junction-flank length for short-overhang reads |
| --outSAMmapqUnique 60 for GATK | STAR manual / GATK RNA best practice | 255 is "unavailable" and is dropped by GATK |
| --outFilterMultimapNmax 20, --outFilterMismatchNoverReadLmax 0.04 | ENCODE long-RNA-seq pipeline | retains real multi-locus genes; mismatch budget scales with read length |
| --genomeSAindexNbases <= min(14, log2(L)/2 - 1) | STAR manual | the default 14 corrupts/segfaults small-genome indexes |
| STAR human-genome index RAM ~30 GB | STAR docs (approximate) | the reason to route memory-constrained jobs to HISAT2 |
| --chimSegmentMin 12 (STAR-Fusion) | STAR-Fusion wiki | minimum chimeric-segment length for fusion calling |
Common Errors
| Error / symptom | Cause | Solution |
|---|---|---|
| Empty / tiny RNA VCF after GATK | STAR 255 MAPQ dropped as "unavailable" | --outSAMmapqUnique 60 (and SplitNCigarReads) |
| GATK "no read group" on a STAR BAM | STAR omits @RG unless asked | add --outSAMattrRGline ID:.. SM:.. PL:ILLUMINA LB:.. (space-separated, not bwa's -R) |
| htseq-count silently miscounts STAR output | htseq wants name-sorted (or -r pos) input; STAR coordinate-sorts |
emit --outSAMtype BAM Unsorted for htseq, or samtools sort -n; featureCounts accepts either |
| Counts ~halved, antisense artifacts | wrong strandedness column / -s |
infer from GeneCounts cols 3 vs 4 (or RSeQC); TruSeq/dUTP = reverse |
| Paralog/gene-family genes under-counted | a -q 10 MAPQ filter deleted multimappers |
drop the MAPQ filter; handle NH>1 in the counter -> rna-quantification |
| "not enough memory for BAM sorting" | --limitBAMsortRAM too small |
set it explicitly (e.g. 10000000000) |
| Segfault / bad index on a small genome | --genomeSAindexNbases 14 |
reduce per min(14, log2(L)/2 - 1) |
| "Genome version INCOMPATIBLE" | index built with another STAR version | rebuild with the running version; pin it |
| 0 counts despite high mapping rate | genome/GTF contig-naming mismatch | reconcile chr1 vs 1, chrM vs MT (same source/release) -> alignment-files |
STAR reads .gz FASTQ as garbage / fails |
STAR does not auto-detect gzip | add --readFilesCommand zcat (bwa/bowtie2/HISAT2 auto-detect gzip) |
References
- Dobin A, Davis CA, Schlesinger F, et al. 2013. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15-21.
- Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. 2019. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37:907-915.
- Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417-419.
- Burset M, Seledtsov IA, Solovyev VV. 2000. Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res 28:4364-4375.
Related Skills
- hisat2-alignment - Low-memory splice-aware alternative to STAR
- bwa-alignment - DNA short-read mapping (when reads do not cross junctions)
- read-qc/rnaseq-qc - RNA destination metrics: rRNA, gene-body coverage, strandedness
- read-qc/fastp-workflow - Trim adapters/poly-A before alignment
- alignment-files/bam-statistics - flagstat/idxstats QC gate; what a high mapping rate hides; contig naming
- rna-quantification/featurecounts-counting - Count aligned reads over genes (NH-aware multimapper handling)
- rna-quantification/alignment-free-quant - Salmon/kallisto when only known-transcript DE is needed
- differential-expression/deseq2-basics - Downstream DE from the count matrix
- single-cell/data-io - STARsolo single-cell counts into a single-cell workflow