bio-alignment-pairwise

star 30

Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.

mdbabumiamssm By mdbabumiamssm schedule Updated 2/4/2026

name: bio-alignment-pairwise description: Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences. tool_type: python primary_tool: Bio.Align measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools: - read_file - run_shell_command

Pairwise Sequence Alignment

Align two sequences using dynamic programming algorithms (Needleman-Wunsch for global, Smith-Waterman for local).

Required Import

from Bio.Align import PairwiseAligner
from Bio.Seq import Seq
from Bio import SeqIO

Core Concepts

Mode Algorithm Use Case
global Needleman-Wunsch Full-length alignment, similar-length sequences
local Smith-Waterman Find best matching regions, different-length sequences

Creating an Aligner

# Basic aligner with defaults
aligner = PairwiseAligner()

# Configure mode and scoring
aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)

# For protein alignment with substitution matrix
from Bio.Align import substitution_matrices
aligner = PairwiseAligner(mode='global', substitution_matrix=substitution_matrices.load('BLOSUM62'))

Performing Alignments

seq1 = Seq('ACCGGTAACGTAG')
seq2 = Seq('ACCGTTAACGAAG')

# Get all optimal alignments
alignments = aligner.align(seq1, seq2)
print(f'Found {len(alignments)} optimal alignments')
print(alignments[0])  # Print first alignment

# Get score only (faster for large sequences)
score = aligner.score(seq1, seq2)

Alignment Output Format

target            0 ACCGGTAACGTAG 13
                  0 |||||.||||.|| 13
query             0 ACCGTTAACGAAG 13

Accessing Alignment Data

alignment = alignments[0]

# Basic properties
print(alignment.score)                    # Alignment score
print(alignment.shape)                    # (num_seqs, alignment_length)
print(len(alignment))                     # Alignment length

# Get aligned sequences with gaps
target_aligned = alignment[0, :]          # First sequence (target) with gaps
query_aligned = alignment[1, :]           # Second sequence (query) with gaps

# Get coordinate mapping
print(alignment.aligned)                  # Array of aligned segment coordinates
print(alignment.coordinates)              # Full coordinate array

Alignment Counts (Identities, Mismatches, Gaps)

alignment = alignments[0]
counts = alignment.counts()

print(f'Identities: {counts.identities}')
print(f'Mismatches: {counts.mismatches}')
print(f'Gaps: {counts.gaps}')

# Calculate percent identity
total_aligned = counts.identities + counts.mismatches
percent_identity = counts.identities / total_aligned * 100
print(f'Percent identity: {percent_identity:.1f}%')

Common Scoring Configurations

DNA/RNA Alignment

aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)

Protein Alignment

from Bio.Align import substitution_matrices
blosum62 = substitution_matrices.load('BLOSUM62')
aligner = PairwiseAligner(mode='global', substitution_matrix=blosum62, open_gap_score=-11, extend_gap_score=-1)

Local Alignment (Find Best Region)

aligner = PairwiseAligner(mode='local', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)

Semiglobal (Overlap/Extension)

# Allow free end gaps on query (useful for primer alignment)
aligner = PairwiseAligner(mode='global')
aligner.query_left_open_gap_score = 0
aligner.query_left_extend_gap_score = 0
aligner.query_right_open_gap_score = 0
aligner.query_right_extend_gap_score = 0

Available Substitution Matrices

from Bio.Align import substitution_matrices
print(substitution_matrices.load())  # List all available matrices

# Common matrices
blosum62 = substitution_matrices.load('BLOSUM62')  # General protein
blosum80 = substitution_matrices.load('BLOSUM80')  # Closely related proteins
pam250 = substitution_matrices.load('PAM250')      # Distantly related proteins

Working with SeqRecord Objects

from Bio import SeqIO

records = list(SeqIO.parse('sequences.fasta', 'fasta'))
seq1, seq2 = records[0].seq, records[1].seq

aligner = PairwiseAligner(mode='global', match_score=1, mismatch_score=-1)
alignments = aligner.align(seq1, seq2)

Iterating Over Multiple Alignments

# Limit number of alignments returned (memory efficient)
aligner.max_alignments = 100

for i, alignment in enumerate(alignments):
    print(f'Alignment {i+1}: score={alignment.score}')
    if i >= 4:
        break

Substitution Matrix from Alignment

alignment = alignments[0]
substitutions = alignment.substitutions

# View as array (rows=target, cols=query)
print(substitutions)

# Access specific substitution counts
# substitutions['A', 'T'] gives count of A aligned to T

Export Alignment to Different Formats

alignment = alignments[0]

# Various output formats
print(format(alignment, 'fasta'))     # FASTA format
print(format(alignment, 'clustal'))   # Clustal format
print(format(alignment, 'psl'))       # PSL format (BLAT)
print(format(alignment, 'sam'))       # SAM format

Quick Reference: Scoring Parameters

Parameter Description Typical DNA Typical Protein
match_score Score for identical bases 1-2 Use matrix
mismatch_score Penalty for mismatches -1 to -3 Use matrix
open_gap_score Cost to start a gap -5 to -15 -10 to -12
extend_gap_score Cost per gap extension -0.5 to -2 -0.5 to -1
substitution_matrix Scoring matrix N/A BLOSUM62

Common Errors

Error Cause Solution
OverflowError Too many optimal alignments Set aligner.max_alignments
Low scores Wrong scoring scheme Use substitution matrix for proteins
No alignments in local mode Scores all negative Ensure match_score > 0

Decision Tree: Choosing Alignment Mode

Need full-length comparison?
├── Yes → Use mode='global'
│   └── Sequences similar length?
│       ├── Yes → Standard global
│       └── No → Consider semiglobal (free end gaps)
└── No → Use mode='local'
    └── Find best matching regions only

Related Skills

  • alignment-io - Save alignments to files in various formats
  • msa-parsing - Work with multiple sequence alignments
  • msa-statistics - Calculate identity, similarity metrics
  • sequence-manipulation/motif-search - Pattern matching in sequences
Install via CLI
npx skills add https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- --skill bio-alignment-pairwise
Repository Details
star Stars 30
call_split Forks 7
navigation Branch main
article Path SKILL.md
More from Creator
mdbabumiamssm
mdbabumiamssm Explore all skills →