bio-alignment-validation - SKILL.md Agent Skill

name: bio-alignment-validation description: Validate alignment quality with insert size distribution, proper pairing rates, GC bias, strand balance, and other post-alignment metrics. Use when verifying alignment data quality before variant calling or quantification. tool_type: mixed primary_tool: samtools

Alignment Validation

Post-alignment quality control to verify alignment quality and identify issues.

Insert Size Distribution

Insert size should match library preparation protocol.

samtools stats

samtools stats input.bam > stats.txt
grep "^IS" stats.txt | cut -f2,3 > insert_sizes.txt

Picard CollectInsertSizeMetrics

java -jar picard.jar CollectInsertSizeMetrics \
    I=input.bam \
    O=insert_metrics.txt \
    H=insert_histogram.pdf

Expected Insert Sizes

Library Type	Expected Size
Standard WGS	300-500 bp
PCR-free	350-550 bp
RNA-seq	150-300 bp
ChIP-seq	150-300 bp
ATAC-seq	Multimodal

Python Insert Size Analysis

import pysam
import numpy as np
import matplotlib.pyplot as plt

def get_insert_sizes(bam_file, max_reads=100000):
    sizes = []
    bam = pysam.AlignmentFile(bam_file, 'rb')
    for i, read in enumerate(bam.fetch()):
        if i >= max_reads:
            break
        if read.is_proper_pair and not read.is_secondary and read.template_length > 0:
            sizes.append(read.template_length)
    bam.close()
    return sizes

sizes = get_insert_sizes('sample.bam')
print(f'Median insert size: {np.median(sizes):.0f}')
print(f'Mean insert size: {np.mean(sizes):.0f}')
print(f'Std dev: {np.std(sizes):.0f}')

plt.hist(sizes, bins=100, range=(0, 1000))
plt.xlabel('Insert Size')
plt.ylabel('Count')
plt.savefig('insert_size_dist.pdf')

Proper Pairing Rate

Percentage of reads correctly paired.

samtools flagstat

samtools flagstat input.bam

samtools flagstat input.bam | grep "properly paired"

Calculate Pairing Rate

proper=$(samtools view -c -f 2 input.bam)
mapped=$(samtools view -c -F 4 input.bam)
rate=$(echo "scale=4; $proper / $mapped * 100" | bc)
echo "Proper pairing rate: ${rate}%"

Expected Rates

Metric	Good	Marginal	Poor
Proper pair	> 90%	80-90%	< 80%
Mapped	> 95%	90-95%	< 90%
Singletons	< 5%	5-10%	> 10%

GC Bias

GC content correlation with coverage.

Picard CollectGcBiasMetrics

java -jar picard.jar CollectGcBiasMetrics \
    I=input.bam \
    O=gc_bias_metrics.txt \
    CHART=gc_bias_chart.pdf \
    S=gc_summary.txt \
    R=reference.fa

deepTools computeGCBias

computeGCBias \
    -b input.bam \
    --effectiveGenomeSize 2913022398 \
    -g hg38.2bit \
    -o gc_bias.txt \
    --biasPlot gc_bias.pdf

Interpret GC Bias

Issue	Symptom
Under-representation	Low GC coverage drops
Over-representation	High GC coverage elevated
PCR bias	Strong correlation

Strand Balance

Forward and reverse strand should be balanced.

Calculate Strand Ratio

forward=$(samtools view -c -F 16 input.bam)
reverse=$(samtools view -c -f 16 input.bam)
echo "Forward: $forward"
echo "Reverse: $reverse"
ratio=$(echo "scale=4; $forward / $reverse" | bc)
echo "F/R ratio: $ratio"

Check Strand Bias per Chromosome

for chr in chr1 chr2 chr3; do
    fwd=$(samtools view -c -F 16 input.bam $chr)
    rev=$(samtools view -c -f 16 input.bam $chr)
    echo "$chr: F=$fwd R=$rev ratio=$(echo "scale=2; $fwd/$rev" | bc)"
done

Mapping Quality Distribution

Extract MAPQ Distribution

samtools view input.bam | cut -f5 | sort -n | uniq -c | sort -k2 -n

Calculate Mean MAPQ

samtools view input.bam | awk '{sum+=$5; count++} END {print "Mean MAPQ:", sum/count}'

MAPQ Thresholds

MAPQ	Meaning
0	Multi-mapper
1-10	Low confidence
20-30	Moderate
40+	High confidence
60	Unique (BWA)

Chromosome Coverage Balance

Calculate Per-Chromosome Coverage

samtools idxstats input.bam | awk '{print $1, $3/$2}' | head -25

Check for Aneuploidy/Contamination

samtools idxstats input.bam | awk '$3 > 0 {
    sum += $3
    len[$1] = $2
    reads[$1] = $3
} END {
    for (chr in reads) {
        expected = len[chr] / sum * reads[chr]
        ratio = reads[chr] / expected
        if (ratio < 0.8 || ratio > 1.2) print chr, ratio
    }
}'

Mismatch Rate

Picard CollectAlignmentSummaryMetrics

java -jar picard.jar CollectAlignmentSummaryMetrics \
    I=input.bam \
    R=reference.fa \
    O=alignment_summary.txt

Key Metrics

Metric	Description	Good Value
PCT_PF_READS_ALIGNED	Mapped %	> 95%
PF_MISMATCH_RATE	Mismatches	< 1%
PF_INDEL_RATE	Indels	< 0.1%
STRAND_BALANCE	Strand ratio	~0.5

Comprehensive Validation Script

#!/bin/bash
BAM=$1
REF=$2
NAME=$(basename $BAM .bam)
OUTDIR=${3:-qc}

mkdir -p $OUTDIR

echo "=== Alignment Validation: $NAME ===" | tee $OUTDIR/report.txt

echo -e "\n--- Flagstat ---" | tee -a $OUTDIR/report.txt
samtools flagstat $BAM | tee -a $OUTDIR/report.txt

echo -e "\n--- Mapping Rate ---" | tee -a $OUTDIR/report.txt
mapped=$(samtools view -c -F 4 $BAM)
total=$(samtools view -c $BAM)
rate=$(echo "scale=2; $mapped / $total * 100" | bc)
echo "Mapping rate: ${rate}%" | tee -a $OUTDIR/report.txt

echo -e "\n--- Proper Pairing ---" | tee -a $OUTDIR/report.txt
proper=$(samtools view -c -f 2 $BAM)
pair_rate=$(echo "scale=2; $proper / $mapped * 100" | bc)
echo "Proper pairing: ${pair_rate}%" | tee -a $OUTDIR/report.txt

echo -e "\n--- Insert Size ---" | tee -a $OUTDIR/report.txt
samtools stats $BAM | grep "insert size average" | tee -a $OUTDIR/report.txt

echo -e "\n--- Strand Balance ---" | tee -a $OUTDIR/report.txt
fwd=$(samtools view -c -F 16 $BAM)
rev=$(samtools view -c -f 16 $BAM)
strand_ratio=$(echo "scale=3; $fwd / $rev" | bc)
echo "Forward: $fwd, Reverse: $rev, Ratio: $strand_ratio" | tee -a $OUTDIR/report.txt

echo -e "\n--- Chromosome Coverage ---" | tee -a $OUTDIR/report.txt
samtools idxstats $BAM | head -25 | tee -a $OUTDIR/report.txt

echo -e "\nReport: $OUTDIR/report.txt"

Python Validation Module

import pysam
import numpy as np
from collections import Counter

class AlignmentValidator:
    def __init__(self, bam_file):
        self.bam = pysam.AlignmentFile(bam_file, 'rb')

    def mapping_rate(self):
        stats = self.bam.get_index_statistics()
        mapped = sum(s.mapped for s in stats)
        unmapped = sum(s.unmapped for s in stats)
        return mapped / (mapped + unmapped) * 100

    def proper_pair_rate(self, sample_size=100000):
        proper = 0
        paired = 0
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if read.is_paired:
                paired += 1
                if read.is_proper_pair:
                    proper += 1
        return proper / paired * 100 if paired > 0 else 0

    def mapq_distribution(self, sample_size=100000):
        mapqs = []
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if not read.is_unmapped:
                mapqs.append(read.mapping_quality)
        return Counter(mapqs)

    def strand_balance(self, sample_size=100000):
        forward = 0
        reverse = 0
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if not read.is_unmapped:
                if read.is_reverse:
                    reverse += 1
                else:
                    forward += 1
        return forward / (forward + reverse) if (forward + reverse) > 0 else 0.5

    def report(self):
        print(f'Mapping rate: {self.mapping_rate():.1f}%')
        print(f'Proper pairing: {self.proper_pair_rate():.1f}%')
        print(f'Strand balance: {self.strand_balance():.3f}')

        mapq_dist = self.mapq_distribution()
        high_qual = sum(v for k, v in mapq_dist.items() if k >= 30)
        total = sum(mapq_dist.values())
        print(f'High MAPQ (>=30): {high_qual/total*100:.1f}%')

    def close(self):
        self.bam.close()

validator = AlignmentValidator('sample.bam')
validator.report()
validator.close()

Quality Thresholds Summary

Metric	Good	Warning	Fail
Mapping rate	> 95%	90-95%	< 90%
Proper pairing	> 90%	80-90%	< 80%
Duplicate rate	< 10%	10-20%	> 20%
Strand balance	0.48-0.52	0.45-0.55	Outside
Mean MAPQ	> 40	30-40	< 30
GC bias	< 1.2x	1.2-1.5x	> 1.5x

Related Skills

bam-statistics - Basic flagstat and depth
duplicate-handling - Mark/remove duplicates
alignment-filtering - Filter low-quality reads
chipseq-qc - ChIP-specific metrics