name: bio-long-read-sequencing-clair3-variants description: Calls germline small variants (SNPs and indels) from Oxford Nanopore and PacBio HiFi long reads with Clair3, a two-stage (pileup + full-alignment) deep-learning caller, selecting the chemistry- and basecaller-version-matched model, enabling read-based phasing, and benchmarking against GIAB with stratification. Covers why the model string is the experiment (no auto-detection, silent degradation on mismatch), why ONT homopolymer/STR indels are the residual error whole-genome F1 hides, and the somatic/trio/RNA boundary to the ClairS/Clair3-Trio family. Use when calling germline SNVs/indels from ONT or HiFi BAMs, choosing a Clair3 model, phasing variants, or benchmarking long-read calls. tool_type: cli primary_tool: Clair3
Version Compatibility
Reference examples tested with: Clair3 2.0+, whatshap 2.0+, bcftools 1.19+, hap.py 0.3.15+.
Before using code patterns, verify installed versions match. If versions differ:
- CLI:
<tool> --versionthen<tool> --helpto confirm flags
Results depend on inputs that outlive the binary version - record them:
- The Clair3 MODEL must match the platform + chemistry + basecaller tier + basecaller version (e.g.
r1041_e82_400bps_sup_v500). There is NO auto-detection;--model_pathis mandatory and a mismatch silently degrades calls. - Clair3 v2 moved TensorFlow -> PyTorch; models are
pileup.pt/full_alignment.pt. v1 TensorFlow models do NOT load in v2. - The full ONT model set (every version, hac/fast,
_with_mvsignal-aware) lives in the rerioclair3_models/repo; only a subset is bundled.
If code throws an error, introspect the installed tool (run_clair3.sh --help) and adapt the example to the actual API rather than retrying.
Clair3 Variant Calling
"Call variants from my long reads" -> Run Clair3 with the model that matches how the reads were basecalled, phase, and benchmark with stratification - because the model string, not the command, determines accuracy.
- CLI:
run_clair3.sh --bam_fn=aln.bam --ref_fn=ref.fa --output=out/ --threads=16 --platform=ont --model_path=/models/r1041_e82_400bps_sup_v500
Scope: germline diploid SNPs + small indels. NOT structural variants (-> structural-variants), NOT somatic/mosaic (-> ClairS/ClairS-TO), NOT RNA (-> Clair3-RNA).
The Single Most Important Modern Insight -- The Model String Is the Experiment, and ONT Indels Hide in the Strata
Clair3's accuracy is gated by two facts a naive user misses:
- The model is hand-picked and a mismatch fails silently. There is no auto-detection - the user must point
--model_pathat a specific model folder. Three axes must ALL match: chemistry (r941vsr1041), basecaller tier (fast/hac/sup), and basecaller version (g5014/v430/v500/v520), plus the optional_with_mvsignal-aware axis if the BAM has Doradomvtags. Wrong model = no crash, no warning, measurably worse calls (indels most). Derive the model from the basecaller string in the run metadata; pick the model version closest to but not above the basecaller version. - ONT indels in homopolymers/STRs are the residual error whole-genome F1 conceals. Even on R10.4.1 sup, insertions/deletions in homopolymer runs and short tandem repeats are the weak point (G/C homopolymers worst), because the pore cannot reliably count identical consecutive bases. A genome-wide indel F1 of ~99.5% hides much lower performance inside LowComplexity/homopolymer strata - exactly the medically relevant loci. HiFi largely solves this; do not transfer ONT-indel pessimism to HiFi. Always benchmark with GIAB stratification, never a single global number.
Two-Stage Architecture
Clair3 "symphonizes" two networks: a fast pileup model (summarized per-position statistics) that calls the large majority of sites, and a slow full-alignment model (haplotype-resolved read tensor) that re-evaluates only the uncertain subset. Internally Clair3 phases the top het-SNP pileup calls with WhatsHap, haplotags the BAM, and feeds the haplotagged reads to the full-alignment model - which is why read-based phasing buys ~6% indel F1, not cosmetics. Output: merge_output.vcf.gz (final).
Model Selection
Model name anatomy (r1041_e82_400bps_sup_v500): pore (r1041=R10.4.1), flowcell (e82), speed (400bps), basecaller tier (sup/hac/fast), basecaller version (v500=Dorado 5.0.0, g5014=Guppy 5.0.14). The _with_mv suffix uses Dorado move-table tags for best accuracy when present.
| Data | --platform |
Model |
|---|---|---|
| ONT R10.4.1 sup, Dorado v5.x, mv tags present | ont | r1041_e82_400bps_sup_v520_with_mv |
| ONT R10.4.1 sup, Dorado v5.0.0 | ont | r1041_e82_400bps_sup_v500 |
| ONT R10.4.1 hac | ont | r1041_e82_400bps_hac_v500/_v520 |
| ONT R9.4.1 (any tier) | ont | r941_prom_sup_g5014 |
| PacBio HiFi Revio | hifi | hifi_revio |
| PacBio HiFi Sequel II | hifi | hifi_sequel2 |
| Illumina (supported) | ilmn | ilmn |
| PacBio CLR | - | not supported -> PEPPER-Margin-DeepVariant |
Decision Tree by Scenario
| Scenario | Tool | Why |
|---|---|---|
| Germline SNV/indel, single sample | Clair3 | this skill |
| Somatic, paired tumor-normal | ClairS | VAF-aware; Clair3 germline priors cannot find low-VAF somatic |
| Somatic, tumor-only | ClairS-TO | tumor-only ensemble |
| De novo / Mendelian trio | Clair3-Nova / Clair3-Trio | family-aware |
| Long-read RNA variants | Clair3-RNA | RNA model |
| ONT R10.4.1, also considering DeepVariant | either | neck-and-neck on R10 sup; native-ONT DeepVariant (Kolesnikov 2024) superseded PEPPER-Margin |
| Non-human / draft / bacterial reference | Clair3 + --include_all_ctgs |
default calls only chr1-22,X,Y -> empty output otherwise |
| Cohort joint genotyping | Clair3 gVCF -> GLnexus | bcftools merge on gVCFs is NOT joint genotyping |
Core Commands
# Germline ONT calling (model MUST match the basecaller)
run_clair3.sh \
--bam_fn=aln.bam --ref_fn=ref.fa --output=clair3_out/ \
--threads=16 --platform=ont \
--model_path=/opt/models/r1041_e82_400bps_sup_v500
# final VCF: clair3_out/merge_output.vcf.gz
# Phase the final output VCF (WhatsHap); --longphase_for_phasing swaps only the INTERNAL
# phaser to LongPhase (faster, SV-aware). For a LongPhase-phased final VCF use
# --use_longphase_for_final_output_phasing instead of --enable_phasing.
run_clair3.sh ... --enable_phasing --longphase_for_phasing
# Phased calls go to clair3_out/phased_merge_output.vcf.gz; merge_output.vcf.gz stays UNPHASED.
# Non-human / draft assembly reference - call ALL contigs
run_clair3.sh ... --include_all_ctgs
# Targeted / amplicon panel
run_clair3.sh ... --bed_fn=panel.bed --gvcf
# Benchmark against GIAB with stratification (the step that reveals ONT indel errors)
hap.py giab_truth.vcf.gz clair3_out/merge_output.vcf.gz \
-f giab_confident.bed -r ref.fa --engine=vcfeval \
--stratification giab_stratifications.tsv -o bench/hg002
Per-Method Failure Modes
Silent model mismatch
Trigger: --model_path pointing at a model that does not match the basecaller chemistry/tier/version. Mechanism: no auto-detection; the wrong network runs. Symptom: no error, lower F1 (indels most). Fix: derive the model from the basecaller string; verify the folder exists (rerio for the full set); for v2 ensure .pt models.
Clair3 found nothing on a non-human reference
Trigger: bacterial genome or draft assembly without chr1-22,X,Y names. Mechanism: Clair3 calls only standard human contigs by default. Symptom: near-empty VCF. Fix: --include_all_ctgs.
Global F1 looks great, clinical genes are wrong
Trigger: reporting only whole-genome F1. Mechanism: ONT indel errors concentrate in homopolymer/STR/low-complexity strata. Symptom: ~99.5% global indel F1 but much lower in LowComplexity. Fix: stratify with GIAB BEDs (Dwarshuis 2024); use CMRG for medically relevant genes.
Treating Clair3 as a somatic caller
Trigger: lowering --snp_min_af/--indel_min_af to catch low-VAF variants. Mechanism: germline model expects ~0.5/1.0 allele fractions, is not VAF-aware. Symptom: germline-model false positives at low AF, missed true somatic. Fix: ClairS (paired) / ClairS-TO (tumor-only).
v1 model with v2 Clair3
Trigger: an old TensorFlow model dir with Clair3 v2. Mechanism: v2 needs PyTorch .pt models. Symptom: model load failure. Fix: use pileup.pt/full_alignment.pt models (Converted Rerio).
Quantitative Thresholds
| Threshold | Source | Rationale |
|---|---|---|
| Recommended depth ~20-60x | Clair3 guidance | sensitivity (hets, indels) falls off below ~20x; --min_coverage default 2 is a floor, not a recommendation |
| Phasing buys ~6% indel F1 | Zheng 2022 | haplotagged reads disambiguate indel alleles in repeats |
| ONT R10.4.1 sup: SNP F1 ~99.99%, indel F1 ~99.5% | GIAB benchmarks | indel residual lives in homopolymer/STR strata |
--var_pct_full 0.3 (default) |
Clair3 README | fraction of low-quality pileup calls re-run by full-alignment; raise for recall, slower |
| Stratify with GIAB / CMRG | Dwarshuis 2024 | global F1 hides the ONT indel problem |
Common Errors
| Error / symptom | Cause | Solution |
|---|---|---|
| Empty/near-empty VCF on non-human ref | default calls only chr1-22,X,Y | --include_all_ctgs |
| Model fails to load | v1 TF model with v2 Clair3 | use .pt (PyTorch) models |
--model_path .../models/ont not found |
no generic ont/hifi model |
point at a specific model subfolder |
| Worse-than-expected indels | wrong-version or wrong-tier model | match the basecaller model exactly |
| "joint genotyping" gave odd merges | bcftools merge on gVCFs is not joint calling |
use GLnexus |
| Looking for somatic/low-VAF variants | germline caller | use ClairS / ClairS-TO |
References
- Zheng Z, Li S, Su J, Leung AWS, Lam TW, Luo R. 2022. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling (Clair3). Nat Comput Sci 2:797-803.
- Zheng Z, He M, Yu X, et al. 2026. Accelerated long-read variant calling with Clair3 for whole-genome sequencing. Bioinformatics (advance access) btag181.
- Kolesnikov A, Cook D, Nattestad M, et al. 2024. Local read haplotagging enables accurate long-read small variant calling. Nat Commun 15:5907.
- Dwarshuis N, Kalra D, McDaniel J, et al. 2024. The GIAB genomic stratifications resource for human reference genomes. Nat Commun 15:9029.
- Lin JH, Chen LC, Yu SC, Huang YT. 2022. LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants. Bioinformatics 38(7):1816-1822.
- Chen L, Zheng Z, Su J, et al. 2025. ClairS-TO: a deep-learning method for long-read tumor-only somatic small variant calling. Nat Commun 16:9630.
Related Skills
- basecalling - The basecaller model+version the Clair3 model must match
- long-read-alignment - Produces the BAM (keep
--MD; use minimap2 >=2.28) - haplotype-phasing - whatshap/longphase phasing and haplotagging Clair3 uses internally
- medaka-polishing - ONT consensus; medaka diploid variant calling is deprecated in favor of Clair3
- structural-variants - SVs are out of Clair3's scope (Sniffles2/cuteSV)
- variant-calling/deepvariant - DeepVariant native ONT/HiFi models (neck-and-neck on R10)
- variant-calling/vcf-statistics - Summarize/filter the VCF Clair3 emits
- clinical-databases/variant-prioritization - Prioritize the called variants