bio-variant-calling-filtering-best-practices

star 30

Comprehensive variant filtering including GATK VQSR, hard filters, bcftools expressions, and quality metric interpretation for SNPs and indels. Use when filtering variants using GATK best practices.

mdbabumiamssm By mdbabumiamssm schedule Updated 2/4/2026

name: bio-variant-calling-filtering-best-practices description: Comprehensive variant filtering including GATK VQSR, hard filters, bcftools expressions, and quality metric interpretation for SNPs and indels. Use when filtering variants using GATK best practices. tool_type: mixed primary_tool: bcftools measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools: - read_file - run_shell_command

Variant Filtering Best Practices

Filter Selection Decision Tree

Is dataset large enough for VQSR? (>30 exomes or WGS)
├── Yes → Use VQSR (machine learning)
└── No → Use hard filters
    ├── Germline → GATK recommended thresholds
    └── Somatic → Caller-specific filters + manual review

GATK Hard Filter Thresholds

# SNPs
gatk VariantFiltration \
    -R reference.fa \
    -V raw_snps.vcf \
    -O filtered_snps.vcf \
    --filter-expression "QD < 2.0" --filter-name "QD2" \
    --filter-expression "FS > 60.0" --filter-name "FS60" \
    --filter-expression "MQ < 40.0" --filter-name "MQ40" \
    --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
    --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
    --filter-expression "SOR > 3.0" --filter-name "SOR3"

# Indels
gatk VariantFiltration \
    -R reference.fa \
    -V raw_indels.vcf \
    -O filtered_indels.vcf \
    --filter-expression "QD < 2.0" --filter-name "QD2" \
    --filter-expression "FS > 200.0" --filter-name "FS200" \
    --filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
    --filter-expression "SOR > 10.0" --filter-name "SOR10"

Understanding Quality Metrics

Metric Meaning Good Value
QD Quality by Depth >2 (variant quality normalized by depth)
FS Fisher Strand <60 SNP, <200 indel (strand bias)
MQ Mapping Quality >40 (RMS mapping quality)
MQRankSum MQ Rank Sum >-12.5 (ref vs alt mapping quality)
ReadPosRankSum Read Position >-8 (position in read bias)
SOR Strand Odds Ratio <3 SNP, <10 indel (strand bias)
DP Depth Sample-specific, avoid extremes
GQ Genotype Quality >20 (confidence in genotype)

bcftools filter

Soft vs Hard Filtering

# Hard filter (remove variants)
bcftools filter -e 'QUAL<30' input.vcf.gz -o filtered.vcf

# Soft filter (mark, don't remove)
bcftools filter -s 'LowQual' -e 'QUAL<30' input.vcf.gz -o marked.vcf
# Variants failing filter get "LowQual" in FILTER column

# Include instead of exclude
bcftools filter -i 'QUAL>=30' input.vcf.gz -o filtered.vcf

Expression Syntax

Operator Meaning
<, <=, >, >= Comparison
=, == Equals
!= Not equals
&&, || AND, OR
! NOT

Aggregate Functions

Function Description
MIN(x) Minimum across samples
MAX(x) Maximum across samples
AVG(x) Average across samples
SUM(x) Sum across samples

Common bcftools Filters

# Basic quality filter
bcftools filter -i 'QUAL>30 && DP>10' input.vcf -o filtered.vcf

# Complex filter with multiple metrics
bcftools filter -i 'QUAL>30 && INFO/DP>10 && INFO/DP<500 && \
    (INFO/FS<60 || INFO/FS=".") && INFO/MQ>40' input.vcf -o filtered.vcf

# Genotype-level filters
bcftools filter -i 'FMT/DP>10 && FMT/GQ>20' input.vcf -o filtered.vcf

# Remove filtered sites
bcftools view -f PASS input.vcf -o passed.vcf

# Keep only biallelic SNPs
bcftools view -m2 -M2 -v snps input.vcf -o biallelic_snps.vcf

# Check for missing values
bcftools filter -e 'QUAL="."' input.vcf.gz  # Exclude missing QUAL
bcftools filter -i 'INFO/DP!="."' input.vcf.gz  # Include only if DP exists

bcftools view Filtering

Filter by Variant Type

# SNPs only
bcftools view -v snps input.vcf.gz -o snps.vcf.gz

# Indels only
bcftools view -v indels input.vcf.gz -o indels.vcf.gz

# Exclude SNPs
bcftools view -V snps input.vcf.gz -o no_snps.vcf.gz

Filter by Region

bcftools view -r chr1:1000000-2000000 input.vcf.gz -o region.vcf.gz

# Multiple regions
bcftools view -r chr1:1000-2000,chr2:3000-4000 input.vcf.gz

Filter by Samples

# Include samples
bcftools view -s sample1,sample2 input.vcf.gz -o subset.vcf.gz

# Exclude samples
bcftools view -s ^sample3,sample4 input.vcf.gz -o subset.vcf.gz

Depth Filtering

# Calculate depth percentiles
bcftools query -f '%DP\n' input.vcf | \
    sort -n | \
    awk '{a[NR]=$1} END {print "5th:", a[int(NR*0.05)], "95th:", a[int(NR*0.95)]}'

# Filter to middle 90% of depth distribution
bcftools filter -i 'INFO/DP>10 && INFO/DP<200' input.vcf -o depth_filtered.vcf

Allele Frequency Filters

# Minor allele frequency filter (population data)
bcftools filter -i 'INFO/AF>0.01 && INFO/AF<0.99' input.vcf -o maf_filtered.vcf

# Allele balance for heterozygotes
bcftools filter -i 'GT="het" -> (AD[1]/(AD[0]+AD[1]) > 0.2 && AD[1]/(AD[0]+AD[1]) < 0.8)' \
    input.vcf -o ab_filtered.vcf

Region-Based Filtering

# Exclude problematic regions
bcftools view -T ^blacklist.bed input.vcf -o filtered.vcf

# Keep only exonic regions
bcftools view -R exons.bed input.vcf -o exonic.vcf

Sample-Level Filtering

# Missing genotype rate per sample
bcftools stats -s - input.vcf | grep ^PSC | cut -f3,14

# Filter samples with >10% missing
bcftools view -S good_samples.txt input.vcf -o sample_filtered.vcf

# Filter sites with >5% missing genotypes
bcftools filter -i 'F_MISSING<0.05' input.vcf -o site_filtered.vcf

Variant Type-Specific Filters

# Separate SNPs and indels for different filters
bcftools view -v snps input.vcf -o snps.vcf
bcftools view -v indels input.vcf -o indels.vcf

# Apply different thresholds
bcftools filter -i 'QUAL>30 && INFO/FS<60' snps.vcf -o snps_filtered.vcf
bcftools filter -i 'QUAL>30 && INFO/FS<200' indels.vcf -o indels_filtered.vcf

# Merge back
bcftools concat snps_filtered.vcf indels_filtered.vcf | bcftools sort -o merged.vcf

Multi-Step Pipeline Filtering

bcftools filter -e 'QUAL<30' input.vcf.gz | \
    bcftools filter -e 'INFO/DP<10' | \
    bcftools view -v snps -Oz -o filtered_snps.vcf.gz

# Named soft filters
bcftools filter -s 'LowQual' -e 'QUAL<30' input.vcf.gz | \
    bcftools filter -s 'LowDepth' -e 'INFO/DP<10' -Oz -o marked.vcf.gz

# Later, remove all filtered variants
bcftools view -f PASS marked.vcf.gz -Oz -o pass_only.vcf.gz

Somatic Variant Filters

# Mutect2 specific
gatk FilterMutectCalls \
    -R reference.fa \
    -V mutect2_raw.vcf \
    --contamination-table contamination.table \
    --tumor-segmentation segments.table \
    -O mutect2_filtered.vcf

# Additional somatic filters
bcftools filter -i 'INFO/TLOD>6.3 && FMT/AF[0]>0.05 && FMT/DP[0]>20' \
    mutect2_filtered.vcf -o somatic_final.vcf

Python Filtering (cyvcf2)

from cyvcf2 import VCF, Writer
import numpy as np

vcf = VCF('input.vcf.gz')
writer = Writer('filtered.vcf', vcf)

for variant in vcf:
    qual = variant.QUAL or 0
    dp = variant.INFO.get('DP', 0)
    fs = variant.INFO.get('FS', 0)
    mq = variant.INFO.get('MQ', 0)

    if qual >= 30 and dp >= 10 and fs <= 60 and mq >= 40:
        writer.write_record(variant)

writer.close()
vcf.close()

Filter by Genotype

from cyvcf2 import VCF, Writer

vcf = VCF('input.vcf.gz')
writer = Writer('filtered.vcf', vcf)

for variant in vcf:
    # Keep if at least one sample is heterozygous
    # gt_types: 0=HOM_REF, 1=HET, 2=UNKNOWN, 3=HOM_ALT
    if 1 in variant.gt_types:
        writer.write_record(variant)

writer.close()
vcf.close()

Filter by Sample Depth

from cyvcf2 import VCF, Writer
import numpy as np

vcf = VCF('input.vcf.gz')
writer = Writer('filtered.vcf', vcf)

for variant in vcf:
    depths = variant.format('DP')
    if depths is not None and np.min(depths) >= 10:
        writer.write_record(variant)

writer.close()
vcf.close()

Validate Filtering

# Compare before/after stats
bcftools stats input.vcf > before_stats.txt
bcftools stats filtered.vcf > after_stats.txt

# Ti/Tv ratio (should be ~2.1 for WGS, ~2.8-3.3 for exomes)
bcftools stats filtered.vcf | grep 'TSTV'

# Count variants per filter
bcftools query -f '%FILTER\n' filtered.vcf | sort | uniq -c

Quick Reference

Task Command
Quality filter bcftools filter -e 'QUAL<30' in.vcf.gz
Depth filter bcftools filter -e 'INFO/DP<10' in.vcf.gz
SNPs only bcftools view -v snps in.vcf.gz
Indels only bcftools view -v indels in.vcf.gz
PASS only bcftools view -f PASS in.vcf.gz
Soft filter bcftools filter -s 'LowQ' -e 'QUAL<30' in.vcf.gz
Region bcftools view -r chr1:1-1000 in.vcf.gz
Biallelic bcftools view -m2 -M2 in.vcf.gz

Common Errors

Error Cause Solution
no such INFO tag Tag not in VCF Check header with bcftools view -h
syntax error Invalid expression Check operator syntax (|| not or)
empty output Filter too strict Relax thresholds

Related Skills

  • variant-calling/variant-calling - Generate VCF files
  • variant-calling/gatk-variant-calling - GATK VQSR
  • variant-calling/variant-annotation - Annotation after filtering
  • variant-calling/vcf-statistics - Evaluate filter effects
Install via CLI
npx skills add https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- --skill bio-variant-calling-filtering-best-practices
Repository Details
star Stars 30
call_split Forks 7
navigation Branch main
article Path SKILL.md
More from Creator
mdbabumiamssm
mdbabumiamssm Explore all skills →