bio-phasing-imputation-genotype-imputation

star 30

Impute missing genotypes using reference panels with Beagle or Minimac4. Use when increasing variant density for GWAS, harmonizing data across genotyping platforms, or inferring variants not directly typed in array data.

mdbabumiamssm By mdbabumiamssm schedule Updated 2/4/2026

name: bio-phasing-imputation-genotype-imputation description: Impute missing genotypes using reference panels with Beagle or Minimac4. Use when increasing variant density for GWAS, harmonizing data across genotyping platforms, or inferring variants not directly typed in array data. tool_type: cli primary_tool: beagle measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools: - read_file - run_shell_command

Genotype Imputation

Beagle Imputation

# Basic imputation
java -jar beagle.jar \
    gt=study.vcf.gz \
    ref=reference_panel.vcf.gz \
    map=genetic_map.txt \
    out=imputed

# Output: imputed.vcf.gz with imputed genotypes

Beagle with Options

java -Xmx32g -jar beagle.jar \
    gt=study.vcf.gz \
    ref=reference_panel.vcf.gz \
    map=genetic_map.txt \
    out=imputed \
    nthreads=8 \
    gp=true \              # Output genotype probabilities
    ap=true \              # Output allele probabilities
    impute=true \          # Perform imputation (default)
    ne=20000               # Effective population size

Impute Per Chromosome

for chr in {1..22}; do
    java -Xmx32g -jar beagle.jar \
        gt=study.chr${chr}.vcf.gz \
        ref=ref.chr${chr}.vcf.gz \
        map=genetic_maps/plink.chr${chr}.GRCh38.map \
        out=imputed.chr${chr} \
        gp=true \
        nthreads=8
done

# Concatenate
bcftools concat imputed.chr*.vcf.gz -Oz -o imputed.all.vcf.gz
bcftools index imputed.all.vcf.gz

IMPUTE5 (Alternative)

# Newer IMPUTE software
impute5 \
    --h reference.bcf \
    --m genetic_map.txt \
    --g study.vcf.gz \
    --r chr22 \
    --o imputed.chr22.vcf.gz \
    --threads 8

Minimac4 (Michigan Imputation Server)

# Often used via web server, but can run locally
minimac4 \
    --refHaps reference.m3vcf.gz \
    --haps study.vcf.gz \
    --prefix imputed \
    --format GT,DS,GP \
    --cpus 8

Input Preparation

# 1. Align to reference (strand, allele order)
bcftools +fixref study.vcf.gz -Oz -o fixed.vcf.gz -- \
    -f reference.fa -m flip

# 2. Filter to sites in reference
bcftools isec -n=2 -w1 fixed.vcf.gz reference_sites.vcf.gz \
    -Oz -o study_overlap.vcf.gz

# 3. Phase first (if not already phased)
java -jar beagle.jar gt=study_overlap.vcf.gz out=phased

# 4. Then impute
java -jar beagle.jar gt=phased.vcf.gz ref=reference.vcf.gz out=imputed

Extract Imputation Quality

# INFO/DR2 or INFO/R2 contains imputation quality
bcftools query -f '%CHROM\t%POS\t%ID\t%INFO/DR2\n' imputed.vcf.gz > info_scores.txt

# Filter by quality
bcftools view -i 'INFO/DR2 > 0.3' imputed.vcf.gz -Oz -o imputed_filtered.vcf.gz

Output Formats

Format Field Description
GT 0|0, 0|1, 1|1 Hard-called genotype
DS 0.0-2.0 Dosage (expected ALT allele count)
GP 0.0-1.0,0.0-1.0,0.0-1.0 Genotype probabilities (AA,AB,BB)
DR2/R2 0.0-1.0 Imputation quality score

Using Dosages for GWAS

import pandas as pd

# Extract dosages
# bcftools query -f '%CHROM\t%POS\t%ID[\t%DS]\n' imputed.vcf.gz > dosages.txt

dosages = pd.read_csv('dosages.txt', sep='\t')

# Dosage-based association (treats uncertainty)
# Use --dosage in PLINK2 or similar
# PLINK2 with dosages
plink2 --vcf imputed.vcf.gz dosage=DS \
    --glm \
    --pheno phenotypes.txt \
    --out gwas_results

Quality Thresholds

Analysis Minimum INFO/R2
GWAS discovery 0.3
GWAS fine-mapping 0.8
Meta-analysis 0.5
Polygenic scores 0.9

Key Parameters

Parameter Beagle Description
gt input VCF Study genotypes
ref reference VCF Reference panel
map genetic map Recombination map
gp true/false Output genotype probs
ne 20000 Effective population size
nthreads N CPU threads
window 40 Window size (cM)

Imputation Servers

For large-scale imputation, consider web-based servers:

  • Michigan Imputation Server: imputationserver.sph.umich.edu
  • TOPMed Imputation Server: imputation.biodatacatalyst.nhlbi.nih.gov
  • Sanger Imputation Server: imputation.sanger.ac.uk
# Prepare input for server
# Most require VCF.GZ per chromosome
for chr in {1..22}; do
    bcftools view -r chr${chr} study.vcf.gz -Oz -o study.chr${chr}.vcf.gz
done

Related Skills

  • phasing-imputation/haplotype-phasing - Pre-phasing step
  • phasing-imputation/reference-panels - Reference panel setup
  • phasing-imputation/imputation-qc - Quality control
  • population-genetics/association-testing - GWAS with imputed data
Install via CLI
npx skills add https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- --skill bio-phasing-imputation-genotype-imputation
Repository Details
star Stars 30
call_split Forks 7
navigation Branch main
article Path SKILL.md
More from Creator
mdbabumiamssm
mdbabumiamssm Explore all skills →