bio-variant-normalization

star 30

Normalize indel representation and split multiallelic variants using bcftools norm. Use when comparing variants from different callers or preparing VCF for downstream analysis.

mdbabumiamssm By mdbabumiamssm schedule Updated 2/4/2026

name: bio-variant-normalization description: Normalize indel representation and split multiallelic variants using bcftools norm. Use when comparing variants from different callers or preparing VCF for downstream analysis. tool_type: cli primary_tool: bcftools measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools: - read_file - run_shell_command

Variant Normalization

Left-align indels and split multiallelic sites using bcftools norm.

Why Normalize?

The same variant can be represented multiple ways:

# Same deletion, different representations
chr1  100  ATCG  A      (right-aligned)
chr1  100  ATC   A      (left-aligned, normalized)
chr1  101  TCG   T      (different position)

Normalization ensures consistent representation for:

  • Comparing variants from different callers
  • Database lookups (dbSNP, ClinVar)
  • Merging VCF files

bcftools norm

Left-Align Indels

bcftools norm -f reference.fa input.vcf.gz -Oz -o normalized.vcf.gz

Requires reference FASTA to determine left-most representation.

Check for Normalization Issues

bcftools norm -f reference.fa -c s input.vcf.gz > /dev/null
# Reports REF allele mismatches

Check modes (-c):

  • w - Warn on mismatch (default)
  • e - Error on mismatch
  • x - Exclude mismatches
  • s - Set correct REF from reference

Multiallelic Sites

Split Multiallelic to Biallelic

bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz

Before:

chr1  100  .  A  G,T  30  PASS  .  GT  1/2

After:

chr1  100  .  A  G  30  PASS  .  GT  1/0
chr1  100  .  A  T  30  PASS  .  GT  0/1

Split SNPs Only

bcftools norm -m-snps input.vcf.gz -Oz -o split_snps.vcf.gz

Split Indels Only

bcftools norm -m-indels input.vcf.gz -Oz -o split_indels.vcf.gz

Join Biallelic to Multiallelic

bcftools norm -m+any input.vcf.gz -Oz -o merged.vcf.gz

Split Options

Option Description
-m-any Split all multiallelic sites
-m-snps Split multiallelic SNPs only
-m-indels Split multiallelic indels only
-m-both Split SNPs and indels separately
-m+any Join biallelic sites into multiallelic
-m+snps Join biallelic SNPs
-m+indels Join biallelic indels
-m+both Join SNPs and indels separately

Combined Normalization

Standard Normalization Pipeline

bcftools norm -f reference.fa -m-any input.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz

This:

  1. Left-aligns indels
  2. Splits multiallelic sites

Remove Duplicates After Splitting

bcftools norm -f reference.fa -m-any -d exact input.vcf.gz -Oz -o normalized.vcf.gz

Duplicate removal options (-d):

  • exact - Remove exact duplicates
  • snps - Remove duplicate SNPs
  • indels - Remove duplicate indels
  • both - Remove duplicate SNPs and indels
  • all - Remove all duplicates
  • none - Keep duplicates (default)

Fixing Reference Alleles

Fix Mismatches from Reference

bcftools norm -f reference.fa -c s input.vcf.gz -Oz -o fixed.vcf.gz

This sets REF alleles to match the reference genome.

Exclude Mismatches

bcftools norm -f reference.fa -c x input.vcf.gz -Oz -o clean.vcf.gz

Removes variants where REF doesn't match reference.

Atomize Complex Variants

Split MNPs to SNPs

bcftools norm --atomize input.vcf.gz -Oz -o atomized.vcf.gz

Before:

chr1  100  .  ATG  GCA  30  PASS

After:

chr1  100  .  A  G  30  PASS
chr1  101  .  T  C  30  PASS
chr1  102  .  G  A  30  PASS

Atomize and Left-Align

bcftools norm -f reference.fa --atomize input.vcf.gz -Oz -o atomized.vcf.gz

Old to New Format

Update VCF Version

bcftools norm --old-rec-tag OLD input.vcf.gz -Oz -o updated.vcf.gz

Tags original record for reference.

Common Workflows

Before Comparing Callers

# Normalize both VCFs the same way
for vcf in caller1.vcf.gz caller2.vcf.gz; do
    base=$(basename "$vcf" .vcf.gz)
    bcftools norm -f reference.fa -m-any "$vcf" -Oz -o "${base}.norm.vcf.gz"
    bcftools index "${base}.norm.vcf.gz"
done

# Now compare
bcftools isec -p comparison caller1.norm.vcf.gz caller2.norm.vcf.gz

Before Database Annotation

bcftools norm -f reference.fa -m-any variants.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz
# Now annotate against dbSNP, ClinVar, etc.

Prepare for GWAS

bcftools norm -f reference.fa -m-any -d exact input.vcf.gz | \
    bcftools view -v snps -Oz -o gwas_ready.vcf.gz
bcftools index gwas_ready.vcf.gz

cyvcf2 Normalization Check

Check if Variants Need Normalization

from cyvcf2 import VCF

def needs_normalization(variant):
    # Check for multiallelic
    if len(variant.ALT) > 1:
        return True

    # Check for complex variants (potential MNPs)
    ref, alt = variant.REF, variant.ALT[0]
    if len(ref) > 1 and len(alt) > 1 and len(ref) == len(alt):
        return True

    return False

count = 0
for variant in VCF('input.vcf.gz'):
    if needs_normalization(variant):
        count += 1

print(f'Variants needing normalization: {count}')

Count Multiallelic Sites

from cyvcf2 import VCF

multiallelic = 0
total = 0

for variant in VCF('input.vcf.gz'):
    total += 1
    if len(variant.ALT) > 1:
        multiallelic += 1

print(f'Total variants: {total}')
print(f'Multiallelic sites: {multiallelic}')
print(f'Percentage: {multiallelic/total*100:.1f}%')

Quick Reference

Task Command
Left-align indels bcftools norm -f ref.fa in.vcf.gz
Split multiallelic bcftools norm -m-any in.vcf.gz
Join to multiallelic bcftools norm -m+any in.vcf.gz
Full normalization bcftools norm -f ref.fa -m-any in.vcf.gz
Fix REF alleles bcftools norm -f ref.fa -c s in.vcf.gz
Remove duplicates bcftools norm -d exact in.vcf.gz
Atomize MNPs bcftools norm --atomize in.vcf.gz

Common Errors

Error Cause Solution
REF does not match Wrong reference Use same reference as caller
not sorted Unsorted input Run bcftools sort first
duplicate records Same position twice Use -d to remove

Related Skills

  • variant-calling - Generate VCF files
  • filtering-best-practices - Filter after normalization
  • vcf-manipulation - Compare normalized VCFs
  • variant-annotation - Annotate normalized variants
Install via CLI
npx skills add https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- --skill bio-variant-normalization
Repository Details
star Stars 30
call_split Forks 7
navigation Branch main
article Path SKILL.md
More from Creator
mdbabumiamssm
mdbabumiamssm Explore all skills →