name: bio-read-alignment-hisat2-alignment description: Aligns RNA-seq reads to a genome with HISAT2, the splice-aware aligner whose hierarchical graph FM-index runs at roughly a quarter of STAR's memory (~7 GB for human), whose SNP/haplotype graph index reduces reference bias in the index itself, and whose MAPQ is GATK-friendly (60 for unique, no 255 problem). Use when RNA alignment must fit a memory-constrained machine, when feeding StringTie/Cufflinks transcript assembly via --dta, or when a SNP-aware graph index is wanted for allele-robust mapping. Feature-rich/high-RAM RNA alignment and fusion detection are star-alignment; DE on known transcripts only should skip alignment for rna-quantification/alignment-free-quant; the QC gate and contig-naming reconciliation are alignment-files; counting is rna-quantification. tool_type: cli primary_tool: HISAT2
Version Compatibility
Reference examples tested with: hisat2 2.2+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- CLI:
<tool> --versionthen<tool> --helpto confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
HISAT2 Alignment -- Graph-Indexed Spliced Mapping at a Quarter of STAR's Memory
"Align my RNA-seq reads with low memory" -> Map reads across exon-exon junctions with a hierarchical graph FM-index that fits a small machine -- because HISAT2 buys splice-aware alignment at ~7 GB instead of STAR's ~30 GB, its MAPQ is GATK-friendly, and its SNP-graph index can remove reference bias before a single read is mapped.
- CLI:
hisat2 -p 8 -x index -1 R1.fq.gz -2 R2.fq.gz | samtools sort -@4 -o aligned.bam -
Scope: low-memory RNA splice-aware mapping with HISAT2 -- index building (plain / annotation-aware / SNP-graph), strandedness, the --dta transcript-assembly mode, and manual two-pass. Contig naming and the QC gate -> alignment-files. Feature-rich/high-RAM RNA alignment, native gene counts, and fusion detection -> star-alignment. Counting reads over genes -> rna-quantification. DE without a BAM -> rna-quantification/alignment-free-quant. OUT OF SCOPE: DNA (bwa-alignment/bowtie2-alignment), long reads (long-read-sequencing/long-read-alignment), HLA typing (HISAT-genotype, a separate tool).
The Single Most Important Modern Insight
- The hierarchical graph FM-index is why HISAT2 exists: near-STAR spliced alignment at ~1/4 the RAM. HISAT2 uses one global FM-index to anchor a read plus ~55,000 small local graph FM-indexes (each ~56 kb), and extends a spliced read within the relevant local index rather than stitching genome-wide as STAR does. Most introns fit inside one local window, so spliced extension is a cheap local operation -- the resident human index is ~4-7 GB vs STAR's ~30 GB. That memory win is the reason to choose HISAT2; the cost is slightly lower novel-junction sensitivity than STAR two-pass and no native gene counts or fusion output.
- The SNP/haplotype graph index removes reference bias in the index, and the MAPQ is GATK-friendly. A
hisat2-build --snp --haplotype(or the prebuiltgrch38_snpindex) encodes millions of known variants as alternate graph nodes, so a read carrying a known alt allele traverses the alt node with no mismatch penalty -- the bias that over-counts the reference allele is removed structurally, for all those sites at once, without a per-sample personalized reference. (Private/novel variants still cause bias, so rigorous ASE still needs WASP or a personalized reference.) HISAT2 also assigns unique reads MAPQ 60 (not STAR's 255), so its output goes into GATK without the reassignment STAR needs. - --dta is for transcript assembly only, and using it for plain counting throws away reads.
--dtaraises the minimum anchor length required to report a de-novo spliced alignment, deliberately suppressing short-anchor junction reads -- because StringTie/Cufflinks cannot reliably assemble a transcript from a 3-5 bp anchor and such reads produce spurious isoforms. That trades junction sensitivity for assembly cleanliness, so--dtabelongs only in a transcript-assembly pipeline; for plain gene counting it just discards usable junction reads. Strandedness (--rna-strandness RFfor the common dUTP/TruSeq case) must also be set, or sense reads land in "no feature" and counts roughly halve.
How HISAT2 Splices (the mechanism in brief)
A read is seeded by the global FM-index, then the relevant ~56 kb local FM-index is selected and the read is extended across the junction within it: the unaligned remainder is anchored in the local index and extended by repeated FM-index extension. Because the spliced extension is a narrow, local operation rather than a genome-wide seed-cluster-stitch, HISAT2 needs far less RAM than STAR -- and evaluates a narrower set of candidate splice configurations, which is the source of both its speed/memory advantage and its slightly lower novel-junction sensitivity.
Tool Taxonomy
| Mode / index | Citation | Mechanism / role | When |
|---|---|---|---|
hisat2-build (plain) |
Kim 2019 Nat Biotechnol 37:907 | genome-only HGFM | quick index; junctions supplied at align time |
hisat2-build --ss --exon |
Kim 2019 | annotation-aware HGFM (better short-anchor placement) | when build RAM allows; or use prebuilt *_tran indexes |
hisat2-build --snp --haplotype |
Kim 2019 | SNP/haplotype graph (reference-bias reduction) | allele-robust mapping; the grch38_snp index |
hisat2 alignReads |
Kim 2019 | spliced alignment via local FM-index extension | the default RNA-to-genome mapping |
--dta / --dta-cufflinks |
HISAT2 manual | longer-anchor reporting for assemblers | StringTie / Cufflinks transcript assembly ONLY |
manual two-pass (--novel-splicesite-*) |
HISAT2 manual | discover then reuse novel junctions | novel-junction sensitivity (cohort: merge across samples) |
| STAR | Dobin 2013 Bioinformatics 29:15 | higher RAM, native counts, fusions, 2-pass | feature-rich RNA (route OUT) -> star-alignment |
| Salmon / kallisto | Patro 2017 Nat Methods 14:417 | alignment-free quantification | DE on known transcripts only (route OUT) -> rna-quantification/alignment-free-quant |
Decision Tree by Scenario
| Scenario | Recommended | Why |
|---|---|---|
| RNA-seq on a memory-constrained machine (<32 GB) | HISAT2 | ~7 GB graph index vs STAR's ~30 GB |
| StringTie/Cufflinks transcript assembly | HISAT2 --dta |
longer-anchor reporting the assemblers need |
| Allele-robust mapping / known-variant-aware | HISAT2 SNP-graph index (grch38_snp) |
alt-allele reads traverse graph nodes without penalty |
| RNA variant calling | HISAT2 (MAPQ 60) then GATK SplitNCigarReads | GATK-friendly MAPQ, no 255 reassignment |
| Need native gene counts, fusions, or top novel-junction sensitivity | route OUT to star-alignment | HISAT2 has no GeneCounts/chimeric output |
| DE on known transcripts only | route OUT to rna-quantification/alignment-free-quant | Salmon/kallisto are faster and model multimapping |
| Plain gene-level counting | HISAT2 without --dta |
--dta discards short-anchor junction reads |
Default when uncertain: HISAT2 with --rna-strandness RF (verify the strand), streamed to a coordinate-sorted BAM; add --dta only for transcript assembly.
Build Index
# Plain genome-only index (cheap; supply junctions at align time with --known-splicesite-infile).
hisat2-build -p 8 reference.fa hisat2_index
# Annotation-aware (better short-anchor placement). NOTE: a full human --ss --exon build needs a LOT of RAM;
# prefer the prebuilt grch38_tran / grch38_snp_tran indexes, or pass junctions at align time instead.
hisat2_extract_splice_sites.py annotation.gtf > splice_sites.txt
hisat2_extract_exons.py annotation.gtf > exons.txt
hisat2-build -p 8 --ss splice_sites.txt --exon exons.txt reference.fa hisat2_index
Basic Alignment with Strandedness
# RF = reverse-stranded (dUTP / Illumina TruSeq Stranded mRNA -- the common case). Verify, do not assume.
hisat2 -p 8 -x hisat2_index --rna-strandness RF \
--rg-id sample1 --rg SM:sample1 --rg PL:ILLUMINA \
-1 reads_1.fq.gz -2 reads_2.fq.gz \
--new-summary --summary-file sample.summary.txt | \
samtools sort -@ 4 -o aligned.sorted.bam -
samtools index aligned.sorted.bam
# Single-end stranded: --rna-strandness R (reverse) or F (forward). Unstranded: omit the flag.
For StringTie / Cufflinks (transcript assembly)
# --dta reports longer anchors the assemblers need; use ONLY for assembly, not for plain counting.
hisat2 -p 8 -x hisat2_index --rna-strandness RF --dta \
-1 r1.fq.gz -2 r2.fq.gz | samtools sort -@ 4 -o aligned.bam -
Manual Two-Pass (cohort novel-junction discovery)
# Pass 1: discover novel junctions per sample.
for r1 in *_R1.fq.gz; do
base=$(basename "$r1" _R1.fq.gz); r2=${r1/_R1/_R2}
hisat2 -p 8 -x hisat2_index --novel-splicesite-outfile "${base}.novel.txt" \
-1 "$r1" -2 "$r2" -S /dev/null
done
# Merge across the cohort so every sample sees the same junction set (avoids a per-sample junction batch effect).
cat *.novel.txt | sort -u > cohort.novel.txt
# Pass 2: re-align every sample with the shared novel-junction set.
for r1 in *_R1.fq.gz; do
base=$(basename "$r1" _R1.fq.gz); r2=${r1/_R1/_R2}
hisat2 -p 8 -x hisat2_index --rna-strandness RF --novel-splicesite-infile cohort.novel.txt \
-1 "$r1" -2 "$r2" | samtools sort -@ 4 -o "${base}.bam" -
done
Key Parameters
| Parameter | Default | Description |
|---|---|---|
| -x | -- | index BASENAME |
| -1 / -2 / -U | -- | paired / single-end reads |
| --rna-strandness | unstranded | FR / RF / F / R (dUTP/TruSeq = RF / R) |
| --dta / --dta-cufflinks | off | longer anchors for StringTie / Cufflinks (assembly only) |
| --known-splicesite-infile | -- | supply junctions at align time (cheap-index alternative to --ss build) |
| --novel-splicesite-outfile / -infile | -- | manual two-pass |
| --max-intronlen | 500000 | shorter than STAR's effective ~1 Mb; raise for long-intron genes |
| -k | 5 (HFM) / 10 (HGFM) | max alignments reported per read |
| --no-softclip / --no-spliced-alignment | off | force end-to-end / disable splicing (DNA mode) |
Per-Method Failure Modes
--dta used for plain counting
Trigger: --dta on a run whose downstream is featureCounts/htseq, not StringTie. Mechanism: --dta suppresses short-anchor junction reads. Symptom: lower junction-read recovery and counts than a non-dta run. Fix: drop --dta for counting; keep it only for transcript assembly.
Wrong strandedness
Trigger: omitting or mis-setting --rna-strandness. Mechanism: the XS strand tag is mislabeled and sense reads are assigned to "no feature." Symptom: counts ~halved; StringTie builds transcripts on the wrong strand. Fix: infer strand (RSeQC infer_experiment.py, or STAR GeneCounts) and set RF for dUTP/TruSeq.
--ss --exon human build runs out of RAM
Trigger: a full human annotation-aware build on a small machine. Mechanism: building the annotation-aware HGFM needs far more RAM than a plain build. Symptom: the build is killed (OOM). Fix: use a prebuilt grch38_tran/grch38_snp_tran index, or build plain and pass junctions at align time via --known-splicesite-infile.
max-intronlen too small for long-intron genes
Trigger: the default --max-intronlen 500000 on genes with introns near or above ~1 Mb. Mechanism: junctions longer than the cap are not formed. Symptom: long-gene junction reads soft-clipped or mismapped. Fix: raise --max-intronlen for organisms/genes with very long introns.
Genome/GTF contig-naming mismatch
Trigger: the BAM uses chr1/chrM but the counting GTF uses 1/MT. Mechanism: no overlapping features. Symptom: zero counts despite a high alignment rate. Fix: reconcile naming (same source/release) -> alignment-files.
Quantitative Thresholds
| Threshold | Source | Rationale |
|---|---|---|
| HISAT2 human graph index RAM ~4.3 GB plain / ~6.7 GB SNP | Kim 2019 (approximate) | the ~1/4-of-STAR footprint that motivates choosing HISAT2 |
| --max-intronlen 500000 default | HISAT2 manual | shorter than STAR's ~1 Mb; raise for long-intron genes |
| --rna-strandness RF for dUTP/TruSeq | library-prep chemistry | the overwhelmingly common stranded protocol |
| unique-read MAPQ 60 (since v2.0.4) | HISAT2 manual / changelog | GATK-friendly; no 255 reassignment needed |
| -k 5 (HFM) / 10 (HGFM) | HISAT2 manual | max reported alignments differs by index type |
Common Errors
| Error / symptom | Cause | Solution |
|---|---|---|
| Counts ~halved, wrong-strand transcripts | missing/incorrect --rna-strandness |
infer strand; set RF for dUTP/TruSeq |
| Lower counts than expected | --dta used for plain counting |
drop --dta unless assembling transcripts |
--ss --exon build killed (OOM) |
full human annotation-aware build | use a prebuilt index or --known-splicesite-infile at align time |
| Long-gene junction reads clipped | --max-intronlen too small |
raise it for long-intron genes |
| 0 counts despite high alignment rate | genome/GTF contig-naming mismatch | reconcile chr1 vs 1 (same source/release) -> alignment-files |
| "Could not locate a HISAT2 index" | -x given a .ht2 file |
pass the index basename |
| htseq-count miscounts HISAT2 output | htseq wants name-sorted input | pipe to samtools sort -n for htseq; featureCounts accepts coordinate order |
References
- Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. 2019. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37:907-915.
- Kim D, Langmead B, Salzberg SL. 2015. HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12:357-360.
- Dobin A, Davis CA, Schlesinger F, et al. 2013. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15-21.
- Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417-419.
Related Skills
- star-alignment - Feature-rich, higher-RAM splice-aware alternative (native counts, fusions)
- bwa-alignment - DNA short-read mapping (when reads do not cross junctions)
- read-qc/rnaseq-qc - RNA destination metrics: rRNA, gene-body coverage, strandedness
- read-qc/fastp-workflow - Trim adapters/poly-A before alignment
- alignment-files/bam-statistics - flagstat/idxstats QC gate; what a high mapping rate hides; contig naming
- rna-quantification/featurecounts-counting - Count aligned reads over genes
- rna-quantification/alignment-free-quant - Salmon/kallisto when only known-transcript DE is needed
- differential-expression/deseq2-basics - Downstream DE from the count matrix