bio-vcf-manipulation

star 30

Merge, concatenate, sort, intersect, and subset VCF files using bcftools. Use when combining variant files, comparing call sets, or restructuring VCF data.

mdbabumiamssm By mdbabumiamssm schedule Updated 2/4/2026

name: bio-vcf-manipulation description: Merge, concatenate, sort, intersect, and subset VCF files using bcftools. Use when combining variant files, comparing call sets, or restructuring VCF data. tool_type: cli primary_tool: bcftools measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools: - read_file - run_shell_command

VCF Manipulation

Merge, concat, sort, and compare VCF files using bcftools.

Operations Overview

Operation Command Use Case
Merge bcftools merge Combine samples from multiple VCFs
Concat bcftools concat Combine regions from multiple VCFs
Sort bcftools sort Sort unsorted VCF
Intersect bcftools isec Compare/intersect call sets
Subset bcftools view Extract samples or regions

bcftools merge

Combine multiple VCF files with different samples at the same positions.

Basic Merge

bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

Merge Multiple Files

bcftools merge *.vcf.gz -Oz -o all_samples.vcf.gz

Merge from File List

# files.txt: one VCF path per line
bcftools merge -l files.txt -Oz -o merged.vcf.gz

Handle Missing Genotypes

# Output missing genotypes as ./. (default)
bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

# Output missing as reference (0/0)
bcftools merge --missing-to-ref sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

Force Sample Names

When sample names conflict:

bcftools merge --force-samples sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

Merge Specific Regions

bcftools merge -r chr1:1000000-2000000 sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

bcftools concat

Combine VCF files with same samples from different regions.

Concatenate Chromosomes

bcftools concat chr1.vcf.gz chr2.vcf.gz chr3.vcf.gz -Oz -o genome.vcf.gz

Concatenate All Chromosomes

bcftools concat chr*.vcf.gz -Oz -o genome.vcf.gz

From File List

# files.txt: one VCF path per line (in order)
bcftools concat -f files.txt -Oz -o concatenated.vcf.gz

Allow Overlapping Regions

bcftools concat -a chr1_part1.vcf.gz chr1_part2.vcf.gz -Oz -o chr1.vcf.gz

Remove Duplicates

bcftools concat -a -d all file1.vcf.gz file2.vcf.gz -Oz -o merged.vcf.gz

Options for -d:

  • snps - Remove duplicate SNPs
  • indels - Remove duplicate indels
  • both - Remove duplicate SNPs and indels
  • all - Remove all duplicates
  • exact - Remove exact duplicates only

bcftools sort

Sort VCF by chromosome and position.

Basic Sort

bcftools sort input.vcf -Oz -o sorted.vcf.gz

With Temporary Directory

For large files:

bcftools sort -T /tmp input.vcf.gz -Oz -o sorted.vcf.gz

Memory Limit

bcftools sort -m 4G input.vcf.gz -Oz -o sorted.vcf.gz

bcftools isec

Intersect and compare VCF files.

Find Shared Variants

bcftools isec -p output_dir sample1.vcf.gz sample2.vcf.gz

Creates:

  • 0000.vcf - Private to sample1
  • 0001.vcf - Private to sample2
  • 0002.vcf - Shared (sample1 records)
  • 0003.vcf - Shared (sample2 records)

Output Compressed

bcftools isec -p output_dir -Oz sample1.vcf.gz sample2.vcf.gz

Intersection Only

bcftools isec -p output_dir -n=2 sample1.vcf.gz sample2.vcf.gz
# Only outputs variants present in exactly 2 files

Comparison Options

Flag Description
-n=2 Present in exactly 2 files
-n+2 Present in 2 or more files
-n-2 Present in fewer than 2 files
-n~11 Boolean: file1 AND file2
-n~10 Boolean: file1 AND NOT file2

Two-File Intersection

# Variants in both files
bcftools isec -n=2 -w1 sample1.vcf.gz sample2.vcf.gz -Oz -o shared.vcf.gz

# Variants only in sample1
bcftools isec -n~10 -w1 sample1.vcf.gz sample2.vcf.gz -Oz -o only_sample1.vcf.gz

Complement Mode

# Variants in file1 not in file2
bcftools isec -C sample1.vcf.gz sample2.vcf.gz -Oz -o unique.vcf.gz

Subsetting VCF Files

Extract Samples

bcftools view -s sample1,sample2 input.vcf.gz -Oz -o subset.vcf.gz

Exclude Samples

bcftools view -s ^sample3 input.vcf.gz -Oz -o without_sample3.vcf.gz

From Sample List File

# samples.txt: one sample name per line
bcftools view -S samples.txt input.vcf.gz -Oz -o subset.vcf.gz

Extract Region

bcftools view -r chr1:1000000-2000000 input.vcf.gz -Oz -o region.vcf.gz

Extract Multiple Regions

bcftools view -R regions.bed input.vcf.gz -Oz -o targets.vcf.gz

Renaming Samples

Single Sample

echo "old_name new_name" > rename.txt
bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz

Multiple Samples

# rename.txt format: old_name new_name
cat > rename.txt << EOF
sample1 patient_001
sample2 patient_002
sample3 patient_003
EOF

bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz

Splitting VCF Files

Split by Sample

for sample in $(bcftools query -l input.vcf.gz); do
    bcftools view -s "$sample" input.vcf.gz -Oz -o "${sample}.vcf.gz"
done

Split by Chromosome

for chr in $(bcftools view -h input.vcf.gz | grep "^##contig" | sed 's/.*ID=\([^,]*\).*/\1/'); do
    bcftools view -r "$chr" input.vcf.gz -Oz -o "${chr}.vcf.gz"
done

Split Multiallelic Sites

bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz

Common Workflows

Merge Cohort VCFs

# Create file list
ls *.vcf.gz > files.txt

# Merge all samples
bcftools merge -l files.txt -Oz -o cohort.vcf.gz
bcftools index cohort.vcf.gz

Combine Chromosome VCFs

# After parallel variant calling by chromosome
bcftools concat chr{1..22}.vcf.gz chrX.vcf.gz chrY.vcf.gz -Oz -o genome.vcf.gz
bcftools index genome.vcf.gz

Compare Two Callers

# Find variants called by both GATK and bcftools
bcftools isec -p comparison gatk.vcf.gz bcftools.vcf.gz

# Count results
wc -l comparison/*.vcf

Extract Passing Variants

bcftools view -f PASS input.vcf.gz -Oz -o pass_only.vcf.gz
bcftools index pass_only.vcf.gz

cyvcf2 Python Operations

Note: True VCF merging (combining samples at matching positions) is complex. Use bcftools merge for production work. cyvcf2 is better for filtering/querying.

Concatenate Records (Not True Merge)

from cyvcf2 import VCF, Writer

# WARNING: This concatenates records, not a true merge
# For actual merging of samples, use bcftools merge
vcf1 = VCF('file1.vcf.gz')
writer = Writer('combined.vcf', vcf1)

for variant in vcf1:
    writer.write_record(variant)

writer.close()
vcf1.close()

Find Shared Positions

from cyvcf2 import VCF

# Load positions from first VCF
vcf1_positions = set()
for variant in VCF('sample1.vcf.gz'):
    vcf1_positions.add((variant.CHROM, variant.POS))

# Check second VCF
shared = 0
unique = 0
for variant in VCF('sample2.vcf.gz'):
    if (variant.CHROM, variant.POS) in vcf1_positions:
        shared += 1
    else:
        unique += 1

print(f'Shared: {shared}')
print(f'Unique to sample2: {unique}')

Quick Reference

Task Command
Merge samples bcftools merge s1.vcf.gz s2.vcf.gz -Oz -o merged.vcf.gz
Concat regions bcftools concat chr1.vcf.gz chr2.vcf.gz -Oz -o all.vcf.gz
Sort VCF bcftools sort input.vcf -Oz -o sorted.vcf.gz
Intersect bcftools isec -p dir a.vcf.gz b.vcf.gz
Extract samples bcftools view -s sample1 input.vcf.gz
Rename samples bcftools reheader -s names.txt input.vcf.gz

Common Errors

Error Cause Solution
different samples merge vs concat confusion Use merge for samples, concat for regions
not sorted Unsorted input to concat Sort first or use -a flag
sample name conflict Duplicate sample names Use --force-samples
index required Missing index for merge/isec Run bcftools index first

Related Skills

  • vcf-basics - View and query VCF files
  • filtering-best-practices - Filter variants before manipulation
  • variant-normalization - Normalize before comparing
  • vcf-statistics - Compare statistics after manipulation
Install via CLI
npx skills add https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- --skill bio-vcf-manipulation
Repository Details
star Stars 30
call_split Forks 7
navigation Branch main
article Path SKILL.md
More from Creator
mdbabumiamssm
mdbabumiamssm Explore all skills →