name: h5ad-shiny-data-pipeline description: "Patterns for H5AD-backed Shiny apps with Excel fallback, groovy data in .uns, scVI QC validation, Leiden clustering merge, per-donor tissue extraction" author: smith6jt date: 2026-02-20
H5AD → Shiny Data Pipeline - Research Notes
Experiment Overview
| Item | Details |
|---|---|
| Date | 2026-02-17 |
| Goal | Replace scattered Excel data loading with a single validated H5AD pipeline, while maintaining backward compatibility with existing Excel-based loading |
| Environment | R 4.x, Shiny 1.12.1, anndata (R pkg), Python: scanpy 1.11.5, anndata 0.12.4, scvi-tools 1.4.0, scib-metrics, conda env scvi-env |
| Status | Success |
Context
The Islet Explorer app loaded data from master_results.xlsx (4 sheets: targets, markers, composition, LGALS3), built from per-donor TSV exports. The trajectory tab loaded a separate adata_ins_root.h5ad that was produced by a problematic pipeline (mixed morphology+protein features, weak n_neighbors=5, missing batch correction). Phase 2 rebuilt the pipeline and unified data loading.
Key problems with the old pipeline:
.Xcontained 47 variables (proteins + morphology mixed) — morphology dominated trajectory- n_neighbors=5 (non-standard, too sparse)
- No
X_scVI_mean(no batch correction in embeddings) - Missing phenotype_proportions in
.obsm - 5,158 islets (combined core+peri regions)
- Donor 6533 potentially missing
Verified Workflow
1. Store groovy/TSV tabular data in H5AD .uns
H5AD's .uns dict can store arbitrary arrays. For tabular data (like QuPath TSV exports), store each column as a separate array with a naming convention:
# Python: scripts/build_h5ad_for_app.py
for col in targets_df.columns:
vals = targets_df[col].values
key = f'groovy_targets_{col}'
if pd.api.types.is_bool_dtype(vals):
adata.uns[key] = vals.astype(bool)
elif pd.api.types.is_numeric_dtype(vals):
adata.uns[key] = np.array(vals, dtype=float)
else:
adata.uns[key] = np.array(vals, dtype=str)
# Also store column names and row count for reconstruction
adata.uns['groovy_targets_columns'] = list(targets_df.columns)
adata.uns['groovy_targets_n_rows'] = len(targets_df)
# R: Reconstruct DataFrame from .uns arrays
reconstruct_groovy_df <- function(ad, sheet) {
prefix <- paste0("groovy_", sheet, "_")
cols_key <- paste0("groovy_", sheet, "_columns")
col_names <- ad$uns[[cols_key]]
n_rows <- as.integer(ad$uns[[paste0("groovy_", sheet, "_n_rows")]])
df <- data.frame(row.names = seq_len(n_rows))
for (col in col_names) {
df[[col]] <- ad$uns[[paste0(prefix, col)]]
}
df
}
1b. CRITICAL: Cache ad$uns before column access (Mar 2026 optimization)
Every ad$uns[[key]] call crosses the reticulate Python→R bridge (~221ms each). With 62+ groovy columns × 3 sheets, this creates a 16-second bottleneck. Fix: cache the entire .uns dict as an R list in a single bridge crossing:
# WRONG (16s): 186 individual bridge crossings
targets <- reconstruct_groovy_df(ad, "targets") # calls ad$uns[[key]] 62 times
# RIGHT (0.2s): single bridge crossing, then R list indexing
uns <- ad$uns # one crossing, returns full R list
targets <- reconstruct_groovy_df_from_list(uns, "targets") # indexes R list (instant)
reconstruct_groovy_df_from_list <- function(uns_list, sheet) {
prefix <- paste0("groovy_", sheet, "_")
col_names <- uns_list[[paste0("groovy_", sheet, "_columns")]]
n_rows <- as.integer(uns_list[[paste0("groovy_", sheet, "_n_rows")]])
if (is.null(col_names) || n_rows == 0) return(NULL)
df <- data.frame(row.names = seq_len(n_rows))
for (col in col_names) df[[col]] <- uns_list[[paste0(prefix, col)]]
if ("Case ID" %in% names(df)) df[["Case ID"]] <- suppressWarnings(as.integer(df[["Case ID"]]))
df
}
This applies to ANY reticulate accessor: ad$obs[[col]], ad$var[[col]], ad$obsm[["key"]]. Always cache the parent object as an R object first, then index from the R side.
2. H5AD → Excel fallback in Shiny data loading
Design the H5AD loader to return the same list structure as the existing Excel loader. This means prep_data() needs zero changes.
# In data_loading.R
load_master_auto <- function(h5ad_path = NULL, excel_path = master_path) {
if (!is.null(h5ad_path) && file.exists(h5ad_path)) {
result <- load_master_h5ad(h5ad_path)
if (!is.null(result)) return(result)
}
if (file.exists(excel_path)) return(load_master(excel_path))
stop("No data source found")
}
# In app.R server
master <- reactive({
load_master_auto(h5ad_path = h5ad_path, excel_path = master_path)
})
# prepared() calls prep_data(master()) — unchanged
3. scVI batch correction validation
Validate with three complementary metrics on a subsample (50K cells for efficiency):
# Silhouette batch score (lower = better mixing)
sil_pca = silhouette_score(X_pca, batch_labels, sample_size=10000)
sil_scvi = silhouette_score(X_scVI, batch_labels, sample_size=10000)
# Pass: sil_scvi < sil_pca
# Cell-type silhouette (higher = better separation, preserved biology)
sil_ct_scvi = silhouette_score(X_scVI, celltype_labels, sample_size=10000)
# Pass: sil_ct_scvi >= 0.8 * sil_ct_pca
# LISI (custom implementation using k-NN + inverse Simpson's index)
# Batch LISI: higher = better mixing (ideal = n_batches)
# Cell-type LISI: lower = better separation (ideal = 1)
3b. scVI covariate validation (2026-02-23)
The canonical model uses categorical_covariate_keys=["Age", "Gender"] with NO batch_key. Since each donor has a unique Age value, this effectively provides soft donor correction. A full retraining comparison confirmed the original model wins 4/5 key metrics vs a no-covariate alternative (donor silhouette: -0.04 vs +0.25; PC0 pseudotime r: 0.71 vs 0.10). See the dedicated scvi-covariate-validation skill for full methodology, metrics, and failed attempts.
4. Neighborhood metrics in .obs (Phase 7, 2026-02-19)
Store per-islet neighborhood metrics as additional .obs columns rather than .uns:
# scripts/compute_neighborhood_metrics.py → data/neighborhood_metrics.csv
# Then merged into adata.obs by build_h5ad_for_app.py (step 4.5)
nbr = pd.read_csv(nbr_path)
adata.obs['_combined_islet_id'] = (
adata.obs['imageid'].astype(str) + '_' + adata.obs['base_islet_id'].astype(str)
)
nbr_merge = nbr.set_index('combined_islet_id')[nbr_cols]
adata.obs = adata.obs.join(nbr_merge, on='_combined_islet_id', how='left')
62 columns across 4 categories: peri-islet composition (peri_prop_*, peri_count_*), immune infiltration (immune_frac_*, cd8_to_macro_ratio, tcell_density_peri), enrichment z-scores (enrich_z_*), and distance metrics (min_dist_*). Extract on the R side with grep("^peri_prop_|^immune_|...", colnames(obs)).
5. Per-islet cell CSVs for drill-down (Phase 8, 2026-02-19)
Extract individual cells per qualified islet into small CSV files for client-side rendering:
# scripts/extract_per_islet_cells.py → data/cells/{imageid}_Islet_{N}.csv
# 949 files, 180K cells, 37 cols (X/Y, phenotype, region, morphology, 31 markers)
# Read backed H5AD in chunks of 50K cells for memory efficiency
adata_sc = ad.read_h5ad(sc_path, backed='r')
chunk_size = 50000
for start in range(0, adata_sc.n_obs, chunk_size):
chunk = adata_sc[start:end].to_memory()
# ... filter, extract, write per-islet CSVs
On the R side, use environment-based caching (drilldown_cache <- new.env()) to avoid re-reading CSVs on repeated clicks.
6. Trajectory rebuild from fixed pipeline
The fixed aggregation keeps only proteins in .X and uses X_scVI_mean for neighbors:
# Aggregate with fixed_islet_aggregation.py
adata = create_islet_dataset_fixed(adata_sc, region='islet_only', min_cells=20)
# .X = protein expression only (31 vars)
# .obsm['X_scVI_mean'] = batch-corrected latent means
# .obsm['phenotype_proportions'] = cell type fractions
# Compute trajectory
sc.pp.neighbors(adata, n_neighbors=15, use_rep='X_scVI_mean', metric='cosine')
sc.tl.paga(adata, groups='donor_status')
sc.tl.umap(adata, init_pos='paga')
# Root: ND islet with highest INS expression
sc.tl.diffmap(adata, n_comps=10)
sc.tl.dpt(adata)
Validation checks:
- INS vs pseudotime: Spearman r < -0.3
- GCG vs pseudotime: Spearman r > 0.2
- Donor status ordering: mean(ND) < mean(Aab+) < mean(T1D)
- No single donor > 80% of any pseudotime quintile
Failed Attempts (Critical)
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
Old pipeline: mixing morphology + protein in .X (47 vars) |
Morphology features (cell area, nucleus size) dominated the neighbor graph and trajectory, masking biological signal from protein markers | Always keep .X as protein-only; store morphology in .obsm['morphology'] |
| Old pipeline: n_neighbors=5 | Too few neighbors created disconnected/noisy graphs, especially with <1000 islets | Use n_neighbors=15 (standard) for islet-level data; 5 is only appropriate for very dense single-cell data |
| Old pipeline: PCA-based neighbors (no batch correction) | Donor-specific technical effects drove the UMAP clustering, obscuring disease biology | Use X_scVI_mean with cosine metric for batch-corrected neighbor computation |
Storing DataFrames directly in .uns |
H5AD serialization doesn't handle pandas DataFrames well in .uns (type conversion issues) |
Store each column as a separate numpy array with a naming convention prefix; reconstruct on the R side |
Per-key ad$uns[[col]] in R loop |
Each ad$uns[[key]] call crosses reticulate Python→R bridge (~221ms). With 62 cols × 3 sheets = 186 calls = 16 seconds. Profiled as the #1 startup bottleneck |
Cache uns <- ad$uns as R list (0.2s single crossing), then index from R side. Applies to ALL reticulate accessors (ad$obs, ad$var, ad$obsm) |
reticulate::py$adata <- ad in source(f, local=TRUE) |
py is a special active binding that fails with object 'reticulate' not found when sourced locally. Even requireNamespace("reticulate") doesn't help — py$ needs the package on the search path |
Avoid reticulate::py$ entirely for this use case. Cache R objects (ad$uns, ad$obs) instead of round-tripping through Python namespace |
<<- for lazy-load flag across source() boundaries |
<<- assigns to the defining environment, which may not match the reading environment when files are sourced with different local args |
Use new.env(parent=emptyenv()) with $loaded flag for lazy state. Environment mutation is scope-independent |
| Defining slash commands in CLAUDE.md | Claude Code only discovers skills from .claude/skills/<name>/SKILL.md files — CLAUDE.md is loaded as documentation context only, NOT for skill registration |
Always create skill files in .claude/skills/; git submodule CLAUDE.md files are not searched for skills |
adata.obs.reset_index() when index name matches column |
islets_core_fixed.h5ad has islet_id as both the index name AND a column → pandas raises ValueError: cannot insert islet_id, already exists |
Check if idx_name in adata.obs.columns: reset_index(drop=True) before any merge that uses reset_index |
fillna('') on categorical columns |
h5py can't write NaN in string columns, but .fillna('') on a Categorical raises TypeError: Cannot setitem on a Categorical with a new category |
Always .astype(str) BEFORE .fillna('') for categorical obs columns |
Using CODEX_Pancreas_Donors.xlsx for donor metadata |
The Excel file has a different donor cohort (nPOD pilot: 6090, 6171...) than the CODEX data (6505, 6533...) — ALL merges produce NaN | Derive donor metadata (status, age, gender, AAb flags) from the h5ad obs itself, not from external donor key files. Build donor_key_df from adata.obs.groupby('imageid') |
Missing combined_islet_id in rebuilt h5ad |
islets_core_fixed.h5ad has islet_id (e.g., 6505_Islet_284) but trajectory module references combined_islet_id |
Add adata.obs['combined_islet_id'] = adata.obs['islet_id'].copy() before saving trajectory h5ad |
Heatmap mids vs hm row count mismatch |
mids computed for all 25 bins but hm only has rows for bins with >=3 points → replacement has 25 rows, data has 24 |
Use match(as.character(hm$pt_bin), bin_levels) to index all_mids so skipped bins are excluded |
| Diverging colormap for raw expression on UMAP | scale_color_gradient2(limits=c(-3,3)) assumed z-scored data but .X values are raw mean expression — colorbar misleading |
Use scale_color_viridis_c(option="inferno", limits=range(value)) for continuous min/max scaling |
| Reading full 3.7 GB single-cell H5AD into memory | Python OOM on 16 GB machine. Even backed='r' loads .obs eagerly (~800 MB) |
Use backed='r' AND read .X in chunks of 50K cells. Pre-filter .obs to relevant cells before materializing expression matrix |
Storing neighborhood metrics in .uns |
.uns column-by-column convention works for tabular data but is cumbersome for 62+ columns that are per-observation |
Use .obs for per-observation metrics (1 value per islet). Reserve .uns for tabular data that doesn't map 1:1 to observations (like groovy sheets with different row counts) |
Column name sanitization: spaces and + in phenotype names |
peri_prop_CD8a+ T cell breaks R column access and H5AD string handling |
Sanitize during Python preprocessing: col.replace(' ', '_').replace('+', 'plus') → peri_prop_CD8aplus_T_cell. Do NOT sanitize at load time in R |
| 66 islets with zero peri-islet cells | Some islets have no _exp20um cells in the single-cell H5AD. Setting total_cells_peri=0 caused division-by-zero in fraction metrics |
Set total_cells_peri=0 but use NaN for fraction/ratio metrics. In R, always use na.rm=TRUE and check !is.na() in conditional filters |
Final Parameters
# Islet aggregation (fixed_islet_aggregation.py)
region: islet_only
min_cells: 20
require_paired: false # Don't require peri-islet for trajectory
# Neighbor computation
n_neighbors: 15
use_rep: X_scVI_mean
metric: cosine
# UMAP
init_pos: paga
min_dist: 0.1
spread: 1.5
# DPT
n_comps: 10 # diffusion map components
root: ND islet with max INS expression
# scVI QC thresholds
silhouette_batch: scVI < PCA
silhouette_celltype: scVI >= 0.8 * PCA
batch_lisi: median(scVI) > median(PCA)
# Python environment
conda_env: scvi-env
python_packages: scanpy==1.11.5, anndata==0.12.4, scvi-tools==1.4.0, scib-metrics
Key Insights
- Interface stability: Design H5AD loading to return identical structures as existing Excel loading. This means downstream code (prep_data, modules) needs zero changes. The fallback pattern (
load_master_auto) is clean and testable. .unsfor tabular data: H5AD.unsis a flexible dict, but column-by-column storage with a naming convention (groovy_targets_colname) is more robust than trying to store entire DataFrames.- Subsample for QC metrics: scVI validation doesn't need all 2.6M cells. 50K cells (or 10K for LISI) gives reliable metrics in seconds vs minutes.
- PAGA-initialized UMAP:
init_pos='paga'gives more meaningful UMAP topology than random initialization, especially for trajectory data where group connectivity matters. - Validation gates: Define quantitative pass/fail criteria (r < -0.3 for INS, etc.) before running the pipeline. This prevents post-hoc rationalization of poor trajectories.
- Claude Code skills:
.claude/skills/<name>/SKILL.mdis the ONLY mechanism for registering slash commands. CLAUDE.md, git submodules, and plugin.json files are NOT discovered as skills. - Donor metadata provenance: Never assume an external Excel "donor key" matches your data cohort. Build donor metadata from the canonical h5ad obs itself — it already has
donor_status,age,gender, and AAb flags from the single-cell phenotyping. - Column name compatibility: When h5ad obs columns get renamed across pipeline steps (e.g.,
islet_id→combined_islet_id), add the expected alias before saving. Check what downstream consumers expect by grepping the app code. - H5AD write gotchas: Before
adata.write_h5ad(), iterate all obs columns and convert categoricals →strthenfillna(''). Also check that obs index name doesn't collide with a column name. .obsvs.unsdecision rule: Use.obsfor metrics with 1 value per observation (neighborhood metrics, phenotype proportions). Use.unsfor tabular data with different row counts (groovy exports withCase ID+islet_keyas join keys).- Backed mode + chunking: For 3.7 GB single-cell H5ADs,
backed='r'keeps.Xon disk. But.obsis still loaded eagerly. Read.Xin chunks of 50K cells to avoid OOM. - Column name sanitization pipeline: Sanitize in Python (
space→_, +→plus) during preprocessing, NOT at R load time. This ensures H5AD column names are valid R identifiers. Pattern:col.replace(' ', '_').replace('+', 'plus'). - Graceful NaN handling for missing peri data: 66/1,015 islets lack peri-islet cells. Use
NaNfor fraction metrics,0for count metrics. R side:na.rm=TRUEeverywhere, conditional UI that checks!all(is.na(x)).
Execution Results (2026-02-17)
| Metric | Value |
|---|---|
| scVI batch LISI | 2.73 (PCA) → 8.07 (scVI) |
| INS vs pseudotime | r = -0.741 |
| GCG vs pseudotime | r = 0.379 |
| Donor ordering | ND=0.491 < Aab+=0.516 < T1D=0.743 |
| Max donor quintile fraction | 0.38 (threshold: <0.80) |
| Islets | 1,015 (31 protein vars) |
| islet_explorer.h5ad | 47 MB |
| H5AD vs Excel prep_data | targets=48,438, markers=64,584, comp=5,382 (identical) |
| Total validation checks | 33/33 passed |
Performance Optimization Results (2026-03-05)
| Metric | Before | After |
|---|---|---|
reconstruct_groovy_df() × 3 |
16s (186 bridge crossings) | 0.2s (1 crossing + R list) |
annotations.tsv eager load |
3s (72 MB TSV) | 0s (deferred, never loaded) |
| File sourcing | ~4s | ~1.5s |
anndata::read_h5ad() |
1.3s | 1.3s (unchanged) |
prep_data() |
0.6s | 0.7s (unchanged) |
| Total to first plot | ~22s | ~3.5s |
7. Merging Leiden clustering from external H5AD (Phase 9, 2026-02-20)
Merge islet-level Leiden cluster assignments and UMAP coords from a separate clustering H5AD:
# scripts/build_h5ad_for_app.py — step 4.7
clust = ad.read_h5ad("islet_analysis/islets_core_clustered.h5ad")
leiden_cols = [c for c in clust.obs.columns if c.startswith("leiden_")]
for col in leiden_cols:
adata.obs[col] = clust.obs[col].astype(str).reindex(adata.obs.index)
# UMAP from .obsm['X_umap'] → .obs columns for easy R extraction
umap_df = pd.DataFrame(clust.obsm['X_umap'], index=clust.obs.index,
columns=['leiden_umap_1', 'leiden_umap_2'])
for col in umap_df.columns:
adata.obs[col] = umap_df[col].reindex(adata.obs.index)
Key: both H5ADs indexed by islet_id, so reindex() aligns correctly. Categorical Leiden columns → .astype(str) for h5py compatibility.
8. Per-donor tissue extraction for Spatial tab (Phase 9, 2026-02-20)
Extract ALL cells per donor (core + peri + tissue background) for whole-tissue scatter visualization:
# scripts/extract_per_donor_tissue.py → data/donors/{imageid}.csv
# 15 files, 2.65M total cells, ~78 MB
# Only 5 columns: X_centroid, Y_centroid, phenotype, cell_region, islet_name
# cell_region: "core" (Islet_N), "peri" (Islet_N_exp20um), "tissue" (all others)
obs = sc.obs[["imageid", "Parent", "phenotype", "X_centroid", "Y_centroid"]].copy()
# Parse Parent → cell_region + islet_name
# Group by imageid, write per-donor CSVs
Design choices:
- No expression data (31 markers) — reduces 450+ MB → 78 MB. App only needs spatial coords + metadata for scatter.
- Not backed-mode since we only read
.obs(no.Xaccess needed) — fast enough. - Donor 6533 has 0 core/peri cells (205K tissue-only) — no islet annotations in that sample.
Execution Results — Phase 7-8 (2026-02-19)
| Metric | Value |
|---|---|
| Neighborhood metrics CSV | 1,015 rows × 62 columns |
| Islets with peri data | 949/1,015 (93.5%) |
| Biological signal (immune_frac_peri) | T1D=0.155, Aab+=0.106, ND=0.069 |
| Per-islet cell CSVs | 949 files, 180,019 cells, 110.8 MB |
| Columns per cell CSV | 37 (X/Y, phenotype, region, morphology, 31 markers) |
| Rebuilt islet_explorer.h5ad | 47.6 MB (+60 .obs columns) |
| R syntax check | All files pass parse() |
| App smoke test | Loads without errors |
Execution Results — Phase 9 (2026-02-20)
| Metric | Value |
|---|---|
| Per-donor tissue CSVs | 15 files, 2,649,654 total cells, 77.8 MB |
| Avg cells/donor | 176,644 (range: 131K-218K) |
| Leiden columns merged | 4 (leiden_0.3: 4 clusters, leiden_0.5: 7, leiden_0.8: 12, leiden_1.0: 14) |
| UMAP coords added | leiden_umap_1, leiden_umap_2 |
| Rebuilt islet_explorer.h5ad | 47.6 MB (+6 leiden-related .obs columns) |
| Donor 6533 (edge case) | 205K cells but 0 core, 0 peri (tissue background only) |
| R syntax check | All files parse successfully |
| App HTTP 200 | Confirmed after worker restart |
References
- AnnData format —
.unsfor arbitrary metadata,.obsfor per-observation metrics - scVI-tools — Batch correction with biological covariates
- scib-metrics — Batch integration benchmarking
- Scanpy trajectory analysis — DPT, PAGA, diffusion maps
- scipy.spatial.cKDTree — Efficient nearest-neighbor distance computation for immune cell distances
- Islet Explorer:
data/DATA_PROVENANCE.mdfor full lineage documentation