tooluniverse-epigenomics - SKILL.md Agent Skill

name: tooluniverse-epigenomics description: Production-ready genomics and epigenomics data processing for BixBench questions. Handles methylation array analysis (CpG filtering, differential methylation, age-related CpG detection, chromosome-level density), ChIP-seq peak analysis (peak calling, motif enrichment, coverage stats), ATAC-seq chromatin accessibility, multi-omics integration (expression + methylation correlation), and genome-wide statistics. Pure Python computation (pandas, scipy, numpy, pysam, statsmodels) plus ToolUniverse annotation tools (Ensembl, ENCODE, SCREEN, JASPAR, ReMap, RegulomeDB, ChIPAtlas). Supports BED, BigWig, methylation beta-value matrices, Illumina manifest files, and multi-sample clinical data. Use when processing methylation data, ChIP-seq peaks, ATAC-seq signals, or answering questions about CpG sites, differential methylation, chromatin accessibility, histone marks, or epigenomic statistics.

Genomics and Epigenomics Data Processing

Production-ready computational skill for processing and analyzing epigenomics data. Combines local Python computation (pandas, scipy, numpy, pysam, statsmodels) with ToolUniverse annotation tools for regulatory context. Designed to solve BixBench-style questions about methylation, ChIP-seq, ATAC-seq, and multi-omics integration.

When to Use This Skill

Triggers:

User provides methylation data (beta-value matrices, Illumina arrays) and asks about CpG sites
Questions about differential methylation analysis
Age-related CpG detection or epigenetic clock questions
Chromosome-level methylation density or statistics
ChIP-seq peak files (BED format) with analysis questions
ATAC-seq chromatin accessibility questions
Multi-omics integration (expression + methylation, expression + ChIP-seq)
Genome-wide epigenomic statistics
Questions mentioning "methylation", "CpG", "ChIP-seq", "ATAC-seq", "histone", "chromatin", "epigenetic"
Questions about missing data across clinical/genomic/epigenomic modalities
Regulatory element annotation for processed epigenomic data

NOT for (use other skills instead):

Gene regulation lookup without data files -> Use existing epigenomics annotation pattern
RNA-seq differential expression -> Use tooluniverse-rnaseq-deseq2
Variant calling/annotation from VCF -> Use tooluniverse-variant-analysis
Gene enrichment analysis -> Use tooluniverse-gene-enrichment

Required Python Packages

import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.stats.multitest as mt
# Optional: pysam (BAM/CRAM), gseapy (enrichment)
from tooluniverse import ToolUniverse

KEY PRINCIPLES

Data-first approach - Load and inspect data files BEFORE any analysis
Question-driven - Parse what the user is actually asking and extract the specific numeric answer
File format detection - Automatically detect methylation arrays, BED files, BigWig, clinical data
Coordinate system awareness - Track genome build (hg19, hg38, mm10), handle chr prefix differences
Statistical rigor - Proper multiple testing correction, effect size filtering, sample size awareness
Missing data handling - Explicitly report and handle NaN/missing values
Chromosome normalization - Always normalize chromosome names (chr1 vs 1, chrX vs X)
Report-first - Create output file first, populate progressively
English-first queries - Use English in all tool calls

Complete Workflow

Phase 0: Question Parsing and Data Discovery

CRITICAL FIRST STEP: Before writing ANY code, parse the question to identify what is being asked and what data files are available.

import os, glob
data_dir = "."
all_files = glob.glob(os.path.join(data_dir, "**/*"), recursive=True)

methylation_files = [f for f in all_files if any(x in f.lower() for x in
    ['methyl', 'beta', 'cpg', 'illumina', '450k', '850k', 'epic', 'mval'])]
chipseq_files = [f for f in all_files if any(x in f.lower() for x in
    ['chip', 'peak', 'narrowpeak', 'broadpeak', 'histone'])]
atacseq_files = [f for f in all_files if any(x in f.lower() for x in
    ['atac', 'accessibility', 'openChromatin', 'dnase'])]
bed_files = [f for f in all_files if f.endswith(('.bed', '.bed.gz', '.narrowPeak', '.broadPeak'))]
clinical_files = [f for f in all_files if any(x in f.lower() for x in
    ['clinical', 'patient', 'sample', 'metadata', 'phenotype', 'survival'])]
expression_files = [f for f in all_files if any(x in f.lower() for x in
    ['express', 'rnaseq', 'fpkm', 'tpm', 'counts', 'transcriptom'])]
manifest_files = [f for f in all_files if any(x in f.lower() for x in
    ['manifest', 'annotation', 'probe', 'platform'])]

Extract parameters from question:

Parameter	Default	Example
Significance threshold	0.05	"padj < 0.05", "FDR < 0.01"
Beta difference threshold	0	"
Variance filter	None	"variance > 0.01", "top 5000 most variable"
Chromosome filter	All	"chromosome 17", "autosomes only"
Genome build	hg38	"hg19", "GRCh37", "mm10"

Decision Tree:

METHYLATION data? -> Phase 1 (load, filter, differential methylation, age CpG, density)
CHIP-SEQ data?    -> Phase 2 (load BED, peak stats, annotation, overlap)
ATAC-SEQ data?    -> Phase 3 (open chromatin, NFR analysis)
MULTI-OMICS?      -> Phase 4 (expression-methylation correlation)
CLINICAL?         -> Phase 5 (missing data, complete cases)
ANNOTATION?       -> Phase 6 (ToolUniverse regulatory annotation)
GENOME-WIDE?      -> Phase 7 (chromosome density, ratios)

Phase 1: Methylation Data Processing

Core functions (copy-paste ready):

def load_methylation_data(file_path):
    """Load methylation beta/M-value matrix. Rows=probes, Cols=samples."""
    ext = os.path.splitext(file_path)[1].lower()
    if ext in ['.csv']: return pd.read_csv(file_path, index_col=0)
    elif ext in ['.tsv', '.txt']: return pd.read_csv(file_path, sep='\t', index_col=0)
    elif ext in ['.parquet']: return pd.read_parquet(file_path)
    else:
        try: return pd.read_csv(file_path, sep='\t', index_col=0)
        except: return pd.read_csv(file_path, index_col=0)

def normalize_chromosome(chrom):
    if chrom is None or pd.isna(chrom): return None
    chrom = str(chrom).strip()
    return chrom if chrom.startswith('chr') else 'chr' + chrom

def differential_methylation(beta_df, group1_samples, group2_samples,
                              test='ttest', correction='fdr_bh', alpha=0.05):
    """DMP analysis between two groups. Returns DataFrame with padj, delta_beta."""
    g1, g2 = beta_df[group1_samples], beta_df[group2_samples]
    results = []
    for probe in beta_df.index:
        vals1, vals2 = g1.loc[probe].dropna().values, g2.loc[probe].dropna().values
        if len(vals1) < 2 or len(vals2) < 2:
            results.append({'probe': probe, 'mean_g1': np.nan, 'mean_g2': np.nan,
                'delta_beta': np.nan, 'pvalue': np.nan}); continue
        mean1, mean2 = np.nanmean(vals1), np.nanmean(vals2)
        if test == 'ttest': _, pval = stats.ttest_ind(vals1, vals2, equal_var=False)
        elif test == 'wilcoxon': _, pval = stats.mannwhitneyu(vals1, vals2, alternative='two-sided')
        else: _, pval = stats.ttest_ind(vals1, vals2, equal_var=False)
        results.append({'probe': probe, 'mean_g1': mean1, 'mean_g2': mean2,
            'delta_beta': mean2 - mean1, 'pvalue': pval})
    result_df = pd.DataFrame(results).set_index('probe')
    valid = result_df['pvalue'].dropna()
    if len(valid) > 0:
        _, padj, _, _ = mt.multipletests(valid.values, alpha=alpha, method=correction)
        result_df.loc[valid.index, 'padj'] = padj
    else: result_df['padj'] = np.nan
    return result_df

def chromosome_cpg_density(cpg_probes, manifest, genome='hg38'):
    """CpG density per chromosome. Returns DataFrame with chr, n_cpgs, density_per_bp."""
    chr_lengths = get_chromosome_lengths(genome)
    probe_id_col = 'probe_id' if 'probe_id' in manifest.columns else manifest.columns[0]
    probe_chr = manifest.set_index(probe_id_col) if probe_id_col in manifest.columns else manifest
    chr_col = 'chr' if 'chr' in probe_chr.columns else 'CHR'
    probe_chrs = probe_chr.loc[probe_chr.index.isin(cpg_probes), chr_col].apply(normalize_chromosome)
    chr_counts = probe_chrs.value_counts()
    results = []
    for chrom, count in chr_counts.items():
        if chrom in chr_lengths:
            length = chr_lengths[chrom]
            results.append({'chr': chrom, 'n_cpgs': count, 'chr_length': length,
                'density_per_bp': count / length, 'density_per_mb': count / length * 1e6})
    return pd.DataFrame(results).sort_values('chr',
        key=lambda x: x.str.replace('chr','').replace({'X':'23','Y':'24'}).astype(int))

Genome chromosome lengths (hg38/hg19/mm10): see get_chromosome_lengths() in REFERENCE.md.

Phase 2-3: ChIP-seq / ATAC-seq (summary)

def load_bed_file(file_path, format='auto'):
    """Load BED/narrowPeak/broadPeak. Returns DataFrame with chrom, start, end, ..."""
    # Auto-detect format from extension, read TSV, normalize chromosomes
    # Full implementation in REFERENCE.md

def peak_statistics(peaks_df):
    """Return dict: total_peaks, mean/median_peak_length, total_coverage_bp, per-chromosome counts."""

def find_overlaps(peaks_a, peaks_b, min_overlap=1):
    """Pure-Python interval overlap between two BED DataFrames."""

def annotate_peaks_to_genes(peaks_df, gene_annotation, tss_upstream=2000):
    """Annotate peaks to nearest gene: promoter/gene_body/proximal/distal."""

Full implementations with all edge cases are in REFERENCE.md.

Phase 4-5: Multi-Omics & Clinical Integration (summary)

def correlate_methylation_expression(beta_df, expression_df, probe_gene_map, method='pearson'):
    """Correlate methylation with expression per probe-gene pair. Returns corr + padj."""

def missing_data_analysis(clinical_df=None, expression_df=None, methylation_df=None):
    """Count patients with complete data across modalities."""

def find_complete_cases(data_frames, variables=None):
    """Find sample IDs present in ALL provided DataFrames with no missing values."""

Full implementations in REFERENCE.md.

Phase 6: ToolUniverse Annotation Tools

Tool	Key Parameters	Returns
`ensembl_lookup_gene`	`id`, `species='homo_sapiens'` (REQUIRED)	gene coords, biotype
`ensembl_get_regulatory_features`	`region` (NO "chr" prefix!), `feature`, `species`	regulatory features
`SCREEN_get_regulatory_elements`	`gene_name`, `element_type`, `limit`	cCREs (enhancers, promoters)
`ChIPAtlas_get_experiments`	`operation='get_experiment_list'` (REQUIRED), `genome`, `antigen`	experiment list
`ReMap_get_transcription_factor_binding`	`gene_name`, `cell_type`, `limit`	TF binding sites
`RegulomeDB_query_variant`	`rsid`	regulatory score
`jaspar_search_matrices`	`search`, `collection`, `species`	TF matrices
`ENCODE_search_experiments`	`assay_title`, `target`, `organism`, `limit`	experiment metadata

CRITICAL: Ensembl region format is "17:start-end" (NO "chr" prefix). ChIPAtlas/FourDN ALL require operation parameter.

Common Use Patterns

Methylation array: Load beta matrix + manifest → filter CpGs → define groups → differential_methylation() → apply thresholds → report DMPs

Age-related CpG density: Load beta + ages → correlate per probe → filter significant → map to chromosomes → chromosome_cpg_density() → compute ratios

Multi-omics missing data: Load clinical + expression + methylation → extract sample IDs → find intersection → check NaN in clinical vars → report complete cases

ChIP-seq annotation: Load BED → load gene annotation → annotate_peaks_to_genes() → classify_peak_regions() → report fractions

Methylation-expression: Load both matrices → build probe-gene map → align samples → correlate_methylation_expression() → report anti-correlations

Fallback Strategies

Scenario	Fallback
No manifest file	Build minimal from Ensembl lookup
No pybedtools	Pure Python overlap (pandas intervals)
Low sample count	Non-parametric test (Wilcoxon)
Large dataset (>500K probes)	Pre-filter by variance, chunk processing
Sample ID mismatches	Try truncating TCGA barcodes, partial matching

Extended Reference: Full code implementations for all phases (including get_chromosome_lengths(), load_bed_file(), peak overlap, multi-omics integration, edge cases, and limitations) are in REFERENCE.md.