name: bio-long-read-sequencing-basecalling description: Basecalls raw Oxford Nanopore signal (POD5/FAST5) into reads with Dorado, choosing the chemistry-matched model and accuracy tier (fast/hac/sup), requesting modified bases (5mCG_5hmCG, 6mA, m6A) at basecall time, and handling duplex, demultiplexing, trimming, and HERRO read correction. Covers why the model+version is an irreversible analysis decision, why methylation cannot be recovered later, and why downstream polish/variant models must match the basecaller. Use when converting POD5/FAST5 to reads, picking a Dorado model for R9/R10 or RNA004, enabling methylation calling, basecalling duplex, demultiplexing barcoded runs, or correcting reads for assembly. tool_type: cli primary_tool: dorado
Version Compatibility
Reference examples tested with: Dorado 1.0+, pod5 0.3+, samtools 1.19+, chopper 0.7+.
Before using code patterns, verify installed versions match. If versions differ:
- CLI:
<tool> --versionthen<tool> --helpto confirm flags
Results depend on inputs that outlive the binary version - record them:
- The basecaller MODEL string (e.g.
dna_r10.4.1_e8.2_400bps_sup@v5.2.0) sets the entire error profile and must be propagated to every downstream tool. Pin it. - Modified-base models carry a SECOND version (
..._sup@v5.0.0_5mCG_5hmCG@v3); the mod version can lag the simplex version - checkdorado download --list. - R9.4.1 and RNA002 models were removed from Dorado v1.0 defaults; legacy data needs an archived model path.
If code throws an error, introspect the installed tool (dorado --help, dorado basecaller --help) and adapt the example to the actual API rather than retrying.
Nanopore Basecalling
"Basecall my Nanopore data" -> Convert raw signal (POD5) into reads with Dorado using the chemistry-matched model, deciding the accuracy tier and whether to call modifications now - because the model choice is baked irreversibly into the output.
- CLI:
dorado basecaller sup pod5s/ > calls.bam(simplex),dorado basecaller sup,5mCG_5hmCG pod5s/ > calls.bam(with methylation),dorado duplex sup pod5s/ > duplex.bam(duplex)
PacBio note: PacBio "basecalling" (CCS -> HiFi reads) runs on-instrument/in SMRT Link; users receive HiFi BAMs already at Q20-Q30+. This skill is Oxford Nanopore / Dorado. HiFi assembly lives in genome-assembly/hifi-assembly.
The Single Most Important Modern Insight -- There Is No "The Reads," Only "The Reads As Called By This Model"
Basecalling is not fixed preprocessing that yields a neutral FASTQ. The model and version chosen are an analysis decision written permanently into the BAM, with three consequences a naive user misses:
- Methylation is a basecalling decision, not a later analysis step. Modified bases are inferred from raw signal at basecall time by Remora models and emitted as MM/ML tags. A plain BAM/FASTQ with no MM/ML tags has thrown the signal away - mods CANNOT be recovered without re-basecalling from POD5. If methylation might ever matter, request it now (
sup,5mCG_5hmCG) and KEEP the POD5. See nanopore-methylation. - Downstream polish/variant models must match the basecaller model+version. medaka and Clair3 ship per-model weights (Clair3
r1041_e82_400bps_sup_v500; medaka the dottedr1041_e82_400bps_sup_v5.2.0). A mismatched model silently degrades accuracy with no error. Propagate the basecaller model name to every downstream step. - Mixing model versions across a cohort is a batch effect. Different model versions have different identity and homopolymer-indel error profiles. Re-basecall the WHOLE cohort with ONE current model before joint or differential analysis.
Dorado Subcommand Taxonomy
Dorado (one GPU-first executable) replaced Guppy, which is end-of-life. Bonito is ONT's research/training basecaller (not production); Rerio hosts research-release models (niche mods, bacterial methylation).
| Subcommand | Purpose | Canonical invocation |
|---|---|---|
basecaller |
simplex basecalling | dorado basecaller hac pod5s/ > calls.bam |
duplex |
template+complement duplex | dorado duplex sup pod5s/ > duplex.bam |
demux |
barcode classification/split | dorado demux --kit-name SQK-NBD114-24 --output-dir out/ calls.bam |
trim |
standalone adapter/primer trim | dorado trim reads.bam > trimmed.bam |
aligner |
minimap2 alignment (carries MM/ML) | dorado aligner ref.mmi reads.bam > aln.bam |
correct |
HERRO single-read correction | dorado correct reads.fastq > corrected.fasta |
summary |
sequencing-summary TSV from BAM | dorado summary calls.bam > summary.tsv |
download |
model management | dorado download --model <name> / --list |
Model Naming Scheme (load-bearing)
Format {analyte}_{pore}_{chemistry}_{speed}@v{ver} + optional mod suffix, e.g. dna_r10.4.1_e8.2_400bps_sup@v5.2.0_5mCG_5hmCG@v3.
| Token | Meaning | Examples |
|---|---|---|
| analyte | molecule | dna, rna004 |
| pore | flow-cell generation | r10.4.1 (current), r9.4.1 (legacy) |
| chemistry | kit chemistry | e8.2 (Kit 14) |
| speed | translocation speed -> sampling rate | 400bps (5 kHz DNA), 130bps (RNA004, 4 kHz) |
| tier | model size/accuracy | fast, hac, sup |
| version | model version | @v4.3.0, @v5.2.0, @v6.0.0 |
Passing the bare tier (sup) lets Dorado auto-detect chemistry from POD5 metadata and fetch the matching latest model; pin a version (sup@v5.2.0) or a full path for reproducibility. Append mods comma-separated (sup,5mCG_5hmCG,6mA); only one mod model per canonical base may be active.
Decision Tree by Scenario
| Scenario | Recommended | Why |
|---|---|---|
| Any analysis (variant/assembly/methylation) | sup + matched model, pinned version |
fast/hac error profile leaks into calls |
| Live run / adaptive sampling / quick QC only | fast |
speed; never for downstream analysis |
| Routine work, compute-limited | hac |
strong accuracy/compute balance (v5.2 closed much of the gap to sup) |
| Methylation wanted now or maybe later | sup,5mCG_5hmCG (DNA), keep POD5 |
mods are unrecoverable from a plain BAM -> nanopore-methylation |
| Per-molecule accuracy, low input, phasing | dorado duplex sup |
~Q30 reads, but expect <10% duplex yield |
| Diploid/phased T2T assembly from simplex | dorado correct (HERRO) before assembler |
haplotype-aware Q22->Q40 -> genome-assembly/long-read-assembly |
| Barcoded multiplexed run | basecall --no-trim, then dorado demux |
trimming first strips barcodes before demux sees them |
| Legacy R9.4.1 / RNA002 data | explicit archived model path | removed from Dorado v1.0 default downloads |
| PacBio data | already HiFi; no Dorado step | CCS runs on-instrument -> genome-assembly/hifi-assembly |
Core Commands
# Simplex, super-accuracy, auto-detected chemistry-matched model (BAM is the default output)
dorado basecaller sup pod5s/ > calls.bam
# Pin the model version for reproducibility
dorado basecaller dna_r10.4.1_e8.2_400bps_sup@v5.2.0 pod5s/ > calls.bam
# Call methylation AT basecall time (CpG 5mC + 5hmC); KEEP pod5s/ - mods are unrecoverable later
dorado basecaller sup,5mCG_5hmCG pod5s/ > calls.bam
dorado basecaller sup,6mA pod5s/ > calls.bam # all-context 6mA
# RNA004 direct RNA (cDNA CANNOT call mods - PCR erases the signal):
dorado basecaller rna004_130bps_sup@v5.1.0,m6A_DRACH pod5s/ > rna_mods.bam
# FASTQ output and a per-read quality floor (relative filter, not a calibrated accuracy)
dorado basecaller sup pod5s/ --emit-fastq --min-qscore 10 > calls.fastq
# Duplex (needs raw POD5; cannot be recovered from simplex FASTQ); dx tag marks read types
dorado duplex sup pod5s/ > duplex.bam
# Demultiplex: basecall WITHOUT trimming, then demux (demux trims barcodes itself)
dorado basecaller sup pod5s/ --no-trim > calls.bam
dorado demux --kit-name SQK-NBD114-24 --output-dir demux/ calls.bam
dorado demux --kit-name SQK-NBD114-24 --barcode-both-ends --output-dir demux/ calls.bam # stringent
# HERRO read correction for diploid/phased assembly (input FASTQ of HAC/SUP R10 reads >=10kb -> FASTA)
dorado download --model herro-v1
dorado correct reads.fastq > corrected.fasta
POD5 is ONT's default raw format (faster random access than FAST5). Convert FAST5 first:
pod5 convert fast5 raw/*.fast5 --output pod5s/ # FAST5 is legacy; basecalling it directly is slow
pod5 view pod5s/ # summary table (replaces deprecated `pod5 inspect reads`)
pod5 merge pod5s/*.pod5 --output merged.pod5
Per-Method Failure Modes
Methylation gone forever
Trigger: basecalling without a mod model, then wanting 5mC later. Mechanism: Remora infers mods from raw signal at basecall time; a plain BAM has only bases. Symptom: no MM/ML tags; modkit pileup returns nothing. Fix: re-basecall from POD5 with sup,5mCG_5hmCG; keep POD5 archives.
Barcodes land in unclassified
Trigger: default --trim all basecall, then a separate dorado demux. Mechanism: trimming removes the barcode before demux can read it. Symptom: most reads in unclassified.bam, low classification rate. Fix: basecall --no-trim, then demux (it trims barcodes itself).
Silent accuracy loss downstream
Trigger: polishing/calling with a medaka/Clair3 model that doesn't match the basecaller model+version. Mechanism: per-model neural weights expect a specific error profile. Symptom: no error, just quietly worse consensus/calls. Fix: propagate the basecaller model name; use medaka tools resolve_model --auto_model; pick the matching Clair3 model dir.
Duplex double-counting
Trigger: treating every read in a duplex BAM as an independent molecule. Mechanism: a simplex parent and its duplex offspring both appear. Symptom: inflated coverage/allele counts. Fix: the dx:i:-1 tag marks simplex parents of duplex reads - filter them when counting molecules (dx:i:1 = duplex, dx:i:0 = simplex-only).
Cohort batch effect
Trigger: runs basecalled with different model versions joined for analysis. Mechanism: version-specific identity/indel error profiles confound a technical batch with biology. Symptom: spurious between-run differences. Fix: re-basecall the whole cohort with one model version.
Quantitative Thresholds
| Threshold | Source | Rationale |
|---|---|---|
sup for any analysis |
ONT model guidance | fast/hac error profiles contaminate variant/assembly/methylation calls |
| R10.4.1 SUP modal accuracy ~Q20 (99%) | Sereika 2022 | dual-reader head fixes homopolymers; enables nanopore-only near-finished genomes |
| Duplex read ~Q30; yield typically <10% of reads | community benchmarks | duplex is library-prep/loading-limited, not free accuracy |
| A "Q20" base errs at ~Q12.5 empirically | Delahaye 2021 | nanopore qscores >Q10 are overconfident posteriors; use for relative filtering only |
| HERRO input reads >=10 kbp, HAC/SUP R10 | Dorado correct docs | HERRO operates on 4096-bp chunks; shorter reads dropped |
--min-qscore 10 as a permissive QC floor |
convention | Q10 ~ 90% nominal; a starting filter, not a hard rule |
Common Errors
| Error / symptom | Cause | Solution |
|---|---|---|
| "Failed to determine sequencing chemistry from data" | R9/RNA002 or non-standard kit; bare tier can't auto-resolve | pass an explicit model path; for legacy chemistry use an archived model |
| No MM/ML tags in BAM | basecalled without a mod model | re-basecall from POD5 with sup,5mCG_5hmCG |
Most reads unclassified after demux |
trimmed before demux | basecall --no-trim, then demux |
--model sup errors |
model is the positional arg, not a flag | dorado basecaller sup pod5s/ |
dorado correct reads.bam fails |
input is FASTQ(.gz), output FASTA | dorado correct reads.fastq > corrected.fasta |
| Out of GPU memory | batch too large for VRAM (sup is heaviest) | lower --batchsize; or drop to hac |
| cDNA m6A calling returns nothing | PCR erased native modifications | use direct RNA (RNA004), not cDNA |
References
- Sereika M, Kirkegaard RH, Karst SM, et al. 2022. Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nat Methods 19:823-826.
- Stanojević D, Lin D, Nurk S, Florez de Sessions P, Šikić M. 2026. Telomere-to-telomere assembly using HERRO-corrected Nanopore simplex reads. Nature (online ahead of print). DOI 10.1038/s41586-026-10563-y.
- Wick RR, Judd LM, Holt KE. 2019. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol 20:129.
- Pagès-Gallego M, de Ridder J. 2023. Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling. Genome Biol 24:71.
- Delahaye C, Nicolas J. 2021. Sequencing DNA with nanopores: troubles and biases. PLoS ONE 16(10):e0257521.
- Gamaarachchi H, Samarakoon H, et al. 2025. The enduring advantages of the SLOW5 file format for raw nanopore sequencing data. GigaScience giaf118.
Related Skills
- long-read-qc - Assess read length/quality and run health after basecalling
- nanopore-methylation - Pile up the MM/ML tags this skill must request at basecall time
- long-read-alignment - Map the reads; use
-yto carry MM/ML tags through alignment - medaka-polishing - Consensus model that must match this basecaller model+version
- clair3-variants - Variant model that must match this basecaller model+version
- genome-assembly/long-read-assembly - Assemble the reads (HERRO-corrected for diploid/T2T)
- genome-assembly/hifi-assembly - PacBio HiFi (basecalled on-instrument, not here)
- epitranscriptomics/m6anet-analysis - ONT direct-RNA m6A from signal
- workflows/longread-sv-pipeline - End-to-end basecall -> align -> SV call