name: bio-hi-c-analysis-matrix-operations description: Balances Hi-C contact matrices (ICE via cooler.balance_cooler, KR/SCALE/VC context), computes distance-decay expected with cooltools (expected_cis per-diagonal P(s), expected_trans scalar), builds observed/expected (O/E) matrices, and diagnoses polymer state from the P(s) log-derivative. Covers the within-matrix-vs-cross-sample distinction (balancing is NOT a normalizer), the equal-visibility assumption that CNV/aneuploidy violates (use raw counts for copy-number), cis-only balancing, mad_max/blacklist masking before balancing, multiplicative cooler weights vs divisive juicer weights, and the resolution-vs-depth budget. Use when balancing a .cool/.mcool, computing expected or P(s), making O/E matrices for compartments/loops, deciding ICE vs KR vs SCALE, choosing a resolution for a given depth, or troubleshooting NaN/all-NaN balanced matrices; route cross-sample comparison to hic-differential. tool_type: python primary_tool: cooler
Version Compatibility
Reference examples tested with: cooler 0.10+, cooltools 0.7+, bioframe 0.7+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
pip show <package>thenhelp(module.function)to check signatures - CLI:
<tool> --versionthen<tool> --helpto confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
cooltools standardized its API around 0.7 (functions take a view_df viewframe; expected_cis defaults to smooth=True, aggregate_smoothed=True). .mcool is multi-resolution: analysis functions take a single-resolution URI (file.mcool::/resolutions/10000), never the bare .mcool. A matrix must be balanced (a stored weight column) before O/E, compartments, insulation, or dots; clr.matrix(balance=True) on an unbalanced cooler returns all-NaN.
Hi-C Matrix Operations
"Make the pixels of my Hi-C matrix comparable to each other." -> Balance (remove per-bin coverage bias under equal-visibility), then divide by distance-matched expected (remove the polymer P(s) background) to get O/E.
- Python:
cooler.balance_cooler(clr, cis_only=True, store=True), thencooltools.expected_cis(clr)and divide observed by per-diagonal expected.
The Single Most Important Modern Insight -- Balancing Makes ONE Matrix Self-Consistent; It Does NOT Make Two Matrices Comparable
Balancing (ICE/KR) is a within-matrix operation: it solves for per-bin bias weights so every bin has equal genome-wide visibility, making a single map internally consistent. It does nothing to relate map A to map B. Two balanced matrices at different sequencing depth still differ in absolute magnitude, dynamic range, and noise floor -- and rescale_marginals makes the absolute balanced values arbitrary-scaled anyway. "I balanced both, now I'll subtract/log2-ratio them" is the single most common error in the field: the depth difference is read as biology. Cross-sample comparison requires downsampling to equal valid-pair count, distance-matched O/E, and a replicate-aware differential tool (multiHiCcompare, HiCcompare loess-over-distance, dcHiC) -- route to hic-differential.
Two corollaries that flow from the same equal-visibility model:
- CNV silently breaks balancing. The premise is that every bin should make the same number of contacts; any deviation is technical bias. That is true for a diploid uniform-copy genome and FALSE for tumors/aneuploids -- a 3-copy region genuinely contacts ~3x more. ICE forces equal marginals and ERASES that real copy-number, then redistributes it perversely (post-ICE high-copy regions go cis-depleted, trans-enriched; Servant 2018). Use raw counts for CNV/SV calling (coverage IS the signal -> copy-number); plain ICE/KR on an aneuploid is a category error (use CNV-aware LOIC/CAIC for 3D structure).
- A balanced cis map is still dominated by distance-decay. The A/B plaid and focal loops are a faint modulation under the P(s) background. Dividing by distance-matched expected (O/E) before eigendecomposition is mandatory, or the top eigenvector is just the decay curve, not compartments.
Normalization-Method Taxonomy
| Method | What it does | Mechanism | When |
|---|---|---|---|
| ICE (cooler native) | true matrix balancing | iterative proportional fitting (Sinkhorn); equalizes all marginals | default; robust, converges on sparse/low-depth where KR fails |
| KR (juicer) | true matrix balancing | Knight-Ruiz Newton solver; SAME fixed point as ICE | fast (few iterations); fails to converge on sparse/high-res maps |
| SCALE (juicer) | true matrix balancing | modern KR-family solver, more robust | juicer's default; converges where KR diverges on sparse maps |
| VC / vanilla coverage | NOT true balancing | single pass: divide by row-coverage * col-coverage (one ICE iteration) | fast robust fallback; leaves residual bias |
| VC_SQRT | NOT true balancing | divide by sqrt of coverage product (gentler than VC) | very sparse data where full balancing overfits |
| LOIC / CAIC (Servant 2018) | CNV-aware balancing | condition on copy-number; LOIC keeps the CN effect, CAIC removes it | aneuploid/tumor genomes (plain ICE is wrong here) |
KR and ICE reach the same balanced map -- choose by convergence, not quality: KR is faster but blows up on sparse/low-depth/high-resolution matrices; ICE is the robust default; SCALE is the juicer-side answer when KR fails.
Decision Tree by Scenario
| Scenario | Recommended | Why |
|---|---|---|
| Need to balance a diploid map | cooler.balance_cooler(cis_only=True, store=True) (ICE) |
robust default; cis-only is the analysis convention |
| KR failed to converge (sparse/high-res) | fall back to ICE, or SCALE on the juicer side | same fixed point; ICE/SCALE are the robust solvers |
| Tumor / aneuploid genome, 3D structure | CNV-aware LOIC/CAIC (Servant 2018) | plain ICE erases real copy-number |
| CNV / SV calling from Hi-C | RAW counts (no balancing) | coverage is the signal -> copy-number |
| Compartments at 100kb-1Mb | balance -> expected_cis -> O/E -> Pearson -> eigenvector |
O/E removes P(s) so the plaid is visible -> compartment-analysis |
| Focal loops / TADs | balance -> expected_cis -> O/E |
local enrichment needs the distance background removed -> loop-calling, tad-detection |
| P(s) / polymer-state diagnostic | expected_cis(smooth=True) -> log-derivative |
the derivative reads out loop-extrusion machinery |
| Imported a juicer KR/VC weight column | check divisive_weights before applying |
cooler weights are multiplicative, juicer's are divisive |
| Compare two conditions | downsample to equal depth, then -> hic-differential | balancing is within-matrix, not a cross-sample normalizer |
Bare .mcool passed and KeyError |
use file.mcool::/resolutions/<bp> URI |
.mcool is a container of resolutions |
Balance a Matrix (ICE)
Goal: Remove one-dimensional per-bin coverage bias so every bin has equal genome-wide visibility within this single map.
Approach: Mask low-coverage and blacklisted bins FIRST (mad_max on log-marginals + explicit blacklist of centromere/rDNA/unmappable), drop the first two diagonals (ligation chemistry, not 3D contact), then run cis-only ICE; the multiplicative weight vector is stored in the weight column.
import cooler
clr = cooler.Cooler('matrix.mcool::/resolutions/10000')
bias, stats = cooler.balance_cooler(clr, cis_only=True, mad_max=5, ignore_diags=2, blacklist=None, store=True)
print('converged:', stats['converged'], 'scale:', stats['scale']) # stats also reports var, divisive_weights
clr = cooler.Cooler('matrix.mcool::/resolutions/10000') # re-open to see the stored weights
balanced = clr.matrix(balance=True).fetch('chr1') # raw[i,j] * w[i] * w[j]; masked bins -> NaN
cis_only=True is the convention for compartment/TAD/loop work -- trans signal is weak, noisy ambient ligation that pulls the bias estimates toward trans noise. ignore_diags=2 drops the main diagonal (self-ligation/dangling ends) and first off-diagonal (undigested/religated fragments): huge untrustworthy counts that would dominate the marginals. mad_max=5 filters bins whose log-marginal is >5 MAD below the median; without it a near-empty unmappable/centromeric bin gets a gigantic weight and ICE diverges. Masking (mad_max + blacklist) MUST precede balancing -- balancing cannot rescue a no-signal bin, it amplifies it.
CLI equivalent:
cooler balance --cis-only --mad-max 5 --ignore-diags 2 matrix.mcool::/resolutions/10000
Expected: cis P(s) Curve and trans Scalar
Goal: Build the distance-decay background (the denominator for O/E) and the P(s) curve for diagnostics.
Approach: cis expected is a per-diagonal curve (one value per separation s -- this IS P(s)); trans expected is a single scalar per chromosome-pair block (trans contacts are ~distance-independent). cooltools enforces the split with two functions.
import cooltools
import bioframe
clr = cooler.Cooler('matrix.mcool::/resolutions/10000')
view_df = bioframe.make_viewframe(clr.chromsizes) # whole-chromosome regions; or arms for acrocentric genomes
cvd = cooltools.expected_cis(clr, view_df=view_df, smooth=True, aggregate_smoothed=True, ignore_diags=2)
# columns include: region1, region2, dist, dist_bp, n_valid, count.avg, balanced.avg, balanced.avg.smoothed.agg
trans_exp = cooltools.expected_trans(clr, view_df=view_df) # one balanced.avg per region1-region2 block
smooth=True smooths P(s) in log10(distance) space (smooth_sigma=0.1). Pre-0.7.0 cooltools errored on raw smoothing (clr_weight_name=None, smooth=True, issue #456); that was fixed in 0.7.0, and raw smoothing now returns count.avg.smoothed. Regardless of version, balance first: a raw expected still carries per-bin coverage bias, so it is not a clean P(s)/O/E denominator.
Observed/Expected Matrix
Goal: Divide out the polymer distance-decay so enrichment (loops, plaid) stands above the local background.
Approach: Map the per-diagonal cis expected (balanced.avg, keyed by dist) onto a dense balanced matrix by diagonal offset -- vectorized with numpy diagonal indexing, NOT an O(n^2) Python loop.
import numpy as np
def oe_matrix(clr, region, cvd):
obs = clr.matrix(balance=True).fetch(region)
chrom = region.split(':')[0] if isinstance(region, str) else region[0]
exp = cvd[cvd['region1'] == chrom].set_index('dist')['balanced.avg']
exp_by_dist = exp.reindex(range(obs.shape[0])).to_numpy() # one expected per separation s
expected = exp_by_dist[np.abs(np.subtract.outer(np.arange(obs.shape[0]), np.arange(obs.shape[0])))]
return obs / expected # NaN where expected is NaN (masked diags)
oe = oe_matrix(clr, 'chr1', cvd)
log_oe = np.log2(oe) # symmetric around 0 for display
cooltools also ships cooltools.lib.numutils.observed_over_expected(matrix, mask) (returns a 4-tuple (OE, dist_bins, sum_pixels, n_pixels)) for a self-contained dense O/E without a precomputed cvd.
P(s) Log-Derivative -- the Polymer-State Diagnostic
Goal: Read out chromatin polymer state and loop-extrusion machinery from the shape of the contact-decay curve.
Approach: Take the slope of P(s) in log-log space; a reference slope near -1 over 0.1-1 Mb is the crumpled-globule background, and a loop-extrusion bump (~100kb interphase) appears as a peak in the derivative. Smooth in logspace FIRST or the derivative is pure noise.
agg = cvd[(cvd['region1'] == cvd['region2']) & (cvd['dist'] > 0)].drop_duplicates('dist_bp')
slope = np.gradient(np.log(agg['balanced.avg.smoothed.agg']), np.log(agg['dist_bp']))
# slope ~ -1 over 0.1-1 Mb; a bump toward 0 near ~100kb flags cohesin loop extrusion (flattens on WAPL/RAD21 loss)
Resolution-vs-Depth Budget
The achievable resolution is a function of depth and genome size, not a free choice. Rule of thumb: a bin needs ~1000 contacts to be reliably populated, and the number of bins scales as N^2 with the number of genomic bins -- so halving bin size quarters per-bin coverage. Coarse features are cheap, focal features are expensive:
| Feature | Resolution | Approximate depth (human) | Why |
|---|---|---|---|
| A/B compartments | 100kb-1Mb | tens of millions of valid pairs | chromosome-scale, few large bins -> cheap |
| TADs / insulation | 10-50kb | hundreds of millions | sub-Mb domains; window 5-25x the bin |
| Loops / dots | <=10kb | billions (Rao 2014 in-situ Hi-C ~ billions) | focal kb-scale pixels; coarse bins blur anchors |
Calling 10kb loops from a shallow library binned at 50kb is not a resolution choice -- there is no signal there. Choose the finest resolution where median per-bin contacts stay near ~1000.
Per-Method Failure Modes
Cross-sample subtraction of balanced matrices
Trigger: balancing two libraries then log2-ratioing/subtracting. Mechanism: balancing is within-matrix; balanced magnitude still scales with depth and rescale_marginals makes it arbitrary. Symptom: systematic genome-wide "differences" that track sequencing depth. Fix: downsample to equal valid pairs, compare O/E, use a replicate-aware tool -> hic-differential.
Plain ICE on an aneuploid / tumor
Trigger: balance_cooler on a genome with large copy-number swings. Mechanism: equal-visibility forces equal marginals, erasing the real ~CN-fold coverage. Symptom: high-copy regions look cis-depleted/trans-enriched (Servant 2018); CNV vanishes. Fix: raw counts for CNV calling; LOIC/CAIC for 3D structure.
Masking after (not before) balancing
Trigger: low-coverage centromere/rDNA/unmappable bins left in before ICE. Mechanism: a near-empty bin gets a gigantic bias weight. Symptom: ICE fails to converge, or stripe artifacts radiate from a few bins. Fix: set mad_max and pass blacklist; masking precedes balancing.
Eigendecomposition on a balanced (non-O/E) map
Trigger: compartment calling skips the expected/O/E step. Mechanism: the distance-decay dominates the balanced cis map. Symptom: top eigenvector is the P(s) curve, not the A/B plaid. Fix: divide by expected_cis (O/E) before correlating/eigendecomposing.
Crossing multiplicative and divisive weights
Trigger: applying an imported juicer KR/VC vector as if it were a cooler weight. Mechanism: cooler weights are multiplicative (rawww), juicer's are divisive (raw/w/w). Symptom: correction inverts -- high-bias bins get MORE extreme; nothing errors. Fix: check the weight column's divisive_weights attribute before applying.
O/E from a raw (unbalanced) expected
Trigger: computing P(s)/O/E from expected_cis(clr_weight_name=None). Mechanism: raw expected still carries per-bin coverage bias, so dividing by it does not cleanly remove the polymer background; pre-0.7.0 cooltools additionally errored on smooth=True with raw (issue #456, fixed in 0.7.0). Symptom: O/E still shows coverage stripes; on old cooltools, an error demanding balanced data. Fix: balance first, then expected_cis on the balanced weight column.
Quantitative Thresholds
| Threshold | Source | Rationale |
|---|---|---|
ignore_diags=2 |
cooler default; ICE convention | drops self-ligation/dangling (diag 0) + undigested/religated (diag 1); ligation chemistry, not 3D contact |
mad_max=5 |
cooler default | drops bins >5 MAD below median LOG-marginal; near-empty bins otherwise get exploding weights |
min_nnz=10 |
cooler default | <10 nonzero pixels per row is too sparse to estimate a bias reliably |
tol=1e-5, max_iters=200 |
cooler defaults | convergence = variance of balanced marginals < tol; non-convergence usually = a masking problem, not a tol problem |
smooth_sigma=0.1 |
cooltools default | Gaussian std in log10(distance) units for P(s) smoothing |
| ~1000 contacts/bin | depth-budget convention | per-bin coverage floor for a reliably populated bin; bins scale ~N^2 |
| Compartment res 100kb-1Mb | compartment scale | finer bins mix in TAD/loop structure |
| TAD/insulation res 10-50kb | domain scale | sub-Mb domains; window 5-25x the bin |
| Loop res <=10kb (needs ~billions of pairs) | Rao 2014 in-situ Hi-C | focal kb-scale; coarse bins blur loop anchors |
| P(s) slope ~ -1 over 0.1-1 Mb | crumpled/fractal globule | reference background; deviations/derivative read out polymer state |
Common Errors
| Error / symptom | Cause | Solution |
|---|---|---|
clr.matrix(balance=True) all NaN |
cooler not balanced | run cooler.balance_cooler(..., store=True) first |
| Empty / wrong-resolution result on .mcool | bare .mcool passed |
use file.mcool::/resolutions/<bp> URI |
| ICE not converging | low-coverage bins not masked | raise/set mad_max, pass blacklist; do not just raise max_iters |
expected_cis(smooth=True) errors on raw (pre-0.7.0 only) |
clr_weight_name=None on old cooltools (issue #456, fixed 0.7.0) |
upgrade to cooltools 0.7+, or balance first / smooth=False |
| O/E enrichment inverted on a tumor | ICE applied to an aneuploid | use raw counts / LOIC/CAIC; equal-visibility is violated |
| Imported weight makes bias worse | divisive juicer weight applied as multiplicative | check divisive_weights; reciprocate if needed |
| Cross-condition "difference" tracks depth | subtracting balanced matrices | downsample + O/E + replicate-aware test -> hic-differential |
References
- Imakaev et al. 2012 Nat Methods 9:999-1003 -- ICE iterative correction.
- Knight & Ruiz 2013 IMA J Numer Anal 33(3):1029-1047 -- KR matrix-balancing algorithm.
- Rao et al. 2014 Cell 159(7):1665-1680 -- in-situ Hi-C, KR norm, VC/VC_SQRT, kilobase loops (depth budget).
- Cournac et al. 2012 BMC Genomics 13:436 -- sequential/vanilla-coverage (SCN/VC) normalization.
- Servant et al. 2018 BMC Bioinformatics 19:313 -- CNV-aware normalization (LOIC/CAIC); ICE erases copy-number.
- Abdennur & Mirny 2020 Bioinformatics 36(1):311-316 -- cooler.
- Open2C, Abdennur et al. 2024 PLoS Comput Biol 20(5):e1012067 -- cooltools.
- Open2C, Abdennur et al. 2024 Bioinformatics 40(2):btae088 -- bioframe.
Related Skills
- hic-data-io - Load the cooler files this skill balances; divisive-vs-multiplicative weight naming
- compartment-analysis - Consumes the O/E this skill produces for eigenvector calling
- tad-detection - Insulation needs a cis-balanced matrix
- loop-calling - Dots need balanced + expected as prerequisites
- hic-differential - Cross-sample comparison; the right home for subtracting/ratioing conditions
- hic-visualization - Render balanced/O/E/log matrices
- copy-number/cnv-visualization - Raw-count CNV from Hi-C when balancing would erase copy-number
- genome-intervals/bigwig-tracks - Export the P(s)/expected or eigenvector as a bigWig