h5ad-shiny-data-pipeline

name: h5ad-shiny-data-pipeline description: "Patterns for H5AD-backed Shiny apps with Excel fallback, groovy data in .uns, scVI QC validation, Leiden clustering merge, per-donor tissue extraction" author: smith6jt date: 2026-02-20

H5AD → Shiny Data Pipeline - Research Notes

Experiment Overview

Item	Details
Date	2026-02-17
Goal	Replace scattered Excel data loading with a single validated H5AD pipeline, while maintaining backward compatibility with existing Excel-based loading
Environment	R 4.x, Shiny 1.12.1, anndata (R pkg), Python: scanpy 1.11.5, anndata 0.12.4, scvi-tools 1.4.0, scib-metrics, conda env `scvi-env`
Status	Success

Context

The Islet Explorer app loaded data from master_results.xlsx (4 sheets: targets, markers, composition, LGALS3), built from per-donor TSV exports. The trajectory tab loaded a separate adata_ins_root.h5ad that was produced by a problematic pipeline (mixed morphology+protein features, weak n_neighbors=5, missing batch correction). Phase 2 rebuilt the pipeline and unified data loading.

Key problems with the old pipeline:

.X contained 47 variables (proteins + morphology mixed) — morphology dominated trajectory
n_neighbors=5 (non-standard, too sparse)
No X_scVI_mean (no batch correction in embeddings)
Missing phenotype_proportions in .obsm
5,158 islets (combined core+peri regions)
Donor 6533 potentially missing

Verified Workflow

1. Store groovy/TSV tabular data in H5AD `.uns`

H5AD's .uns dict can store arbitrary arrays. For tabular data (like QuPath TSV exports), store each column as a separate array with a naming convention:

# Python: scripts/build_h5ad_for_app.py
for col in targets_df.columns:
    vals = targets_df[col].values
    key = f'groovy_targets_{col}'
    if pd.api.types.is_bool_dtype(vals):
        adata.uns[key] = vals.astype(bool)
    elif pd.api.types.is_numeric_dtype(vals):
        adata.uns[key] = np.array(vals, dtype=float)
    else:
        adata.uns[key] = np.array(vals, dtype=str)

# Also store column names and row count for reconstruction
adata.uns['groovy_targets_columns'] = list(targets_df.columns)
adata.uns['groovy_targets_n_rows'] = len(targets_df)

# R: Reconstruct DataFrame from .uns arrays
reconstruct_groovy_df <- function(ad, sheet) {
  prefix <- paste0("groovy_", sheet, "_")
  cols_key <- paste0("groovy_", sheet, "_columns")
  col_names <- ad$uns[[cols_key]]
  n_rows <- as.integer(ad$uns[[paste0("groovy_", sheet, "_n_rows")]])
  df <- data.frame(row.names = seq_len(n_rows))
  for (col in col_names) {
    df[[col]] <- ad$uns[[paste0(prefix, col)]]
  }
  df
}

1b. CRITICAL: Cache `ad$uns` before column access (Mar 2026 optimization)

Every ad$uns[[key]] call crosses the reticulate Python→R bridge (~221ms each). With 62+ groovy columns × 3 sheets, this creates a 16-second bottleneck. Fix: cache the entire .uns dict as an R list in a single bridge crossing:

# WRONG (16s): 186 individual bridge crossings
targets <- reconstruct_groovy_df(ad, "targets")  # calls ad$uns[[key]] 62 times

# RIGHT (0.2s): single bridge crossing, then R list indexing
uns <- ad$uns  # one crossing, returns full R list
targets <- reconstruct_groovy_df_from_list(uns, "targets")  # indexes R list (instant)

reconstruct_groovy_df_from_list <- function(uns_list, sheet) {
  prefix <- paste0("groovy_", sheet, "_")
  col_names <- uns_list[[paste0("groovy_", sheet, "_columns")]]
  n_rows <- as.integer(uns_list[[paste0("groovy_", sheet, "_n_rows")]])
  if (is.null(col_names) || n_rows == 0) return(NULL)
  df <- data.frame(row.names = seq_len(n_rows))
  for (col in col_names) df[[col]] <- uns_list[[paste0(prefix, col)]]
  if ("Case ID" %in% names(df)) df[["Case ID"]] <- suppressWarnings(as.integer(df[["Case ID"]]))
  df
}

This applies to ANY reticulate accessor: ad$obs[[col]], ad$var[[col]], ad$obsm[["key"]]. Always cache the parent object as an R object first, then index from the R side.

2. H5AD → Excel fallback in Shiny data loading

Design the H5AD loader to return the same list structure as the existing Excel loader. This means prep_data() needs zero changes.

# In data_loading.R
load_master_auto <- function(h5ad_path = NULL, excel_path = master_path) {
  if (!is.null(h5ad_path) && file.exists(h5ad_path)) {
    result <- load_master_h5ad(h5ad_path)
    if (!is.null(result)) return(result)
  }
  if (file.exists(excel_path)) return(load_master(excel_path))
  stop("No data source found")
}

# In app.R server
master <- reactive({
  load_master_auto(h5ad_path = h5ad_path, excel_path = master_path)
})
# prepared() calls prep_data(master()) — unchanged

3. scVI batch correction validation

Validate with three complementary metrics on a subsample (50K cells for efficiency):

# Silhouette batch score (lower = better mixing)
sil_pca = silhouette_score(X_pca, batch_labels, sample_size=10000)
sil_scvi = silhouette_score(X_scVI, batch_labels, sample_size=10000)
# Pass: sil_scvi < sil_pca

# Cell-type silhouette (higher = better separation, preserved biology)
sil_ct_scvi = silhouette_score(X_scVI, celltype_labels, sample_size=10000)
# Pass: sil_ct_scvi >= 0.8 * sil_ct_pca

# LISI (custom implementation using k-NN + inverse Simpson's index)
# Batch LISI: higher = better mixing (ideal = n_batches)
# Cell-type LISI: lower = better separation (ideal = 1)

3b. scVI covariate validation (2026-02-23)

The canonical model uses categorical_covariate_keys=["Age", "Gender"] with NO batch_key. Since each donor has a unique Age value, this effectively provides soft donor correction. A full retraining comparison confirmed the original model wins 4/5 key metrics vs a no-covariate alternative (donor silhouette: -0.04 vs +0.25; PC0 pseudotime r: 0.71 vs 0.10). See the dedicated scvi-covariate-validation skill for full methodology, metrics, and failed attempts.

4. Neighborhood metrics in `.obs` (Phase 7, 2026-02-19)

Store per-islet neighborhood metrics as additional .obs columns rather than .uns:

# scripts/compute_neighborhood_metrics.py → data/neighborhood_metrics.csv
# Then merged into adata.obs by build_h5ad_for_app.py (step 4.5)
nbr = pd.read_csv(nbr_path)
adata.obs['_combined_islet_id'] = (
    adata.obs['imageid'].astype(str) + '_' + adata.obs['base_islet_id'].astype(str)
)
nbr_merge = nbr.set_index('combined_islet_id')[nbr_cols]
adata.obs = adata.obs.join(nbr_merge, on='_combined_islet_id', how='left')

62 columns across 4 categories: peri-islet composition (peri_prop_*, peri_count_*), immune infiltration (immune_frac_*, cd8_to_macro_ratio, tcell_density_peri), enrichment z-scores (enrich_z_*), and distance metrics (min_dist_*). Extract on the R side with grep("^peri_prop_|^immune_|...", colnames(obs)).

5. Per-islet cell CSVs for drill-down (Phase 8, 2026-02-19)

Extract individual cells per qualified islet into small CSV files for client-side rendering:

# scripts/extract_per_islet_cells.py → data/cells/{imageid}_Islet_{N}.csv
# 949 files, 180K cells, 37 cols (X/Y, phenotype, region, morphology, 31 markers)
# Read backed H5AD in chunks of 50K cells for memory efficiency
adata_sc = ad.read_h5ad(sc_path, backed='r')
chunk_size = 50000
for start in range(0, adata_sc.n_obs, chunk_size):
    chunk = adata_sc[start:end].to_memory()
    # ... filter, extract, write per-islet CSVs

On the R side, use environment-based caching (drilldown_cache <- new.env()) to avoid re-reading CSVs on repeated clicks.

6. Trajectory rebuild from fixed pipeline

The fixed aggregation keeps only proteins in .X and uses X_scVI_mean for neighbors:

# Aggregate with fixed_islet_aggregation.py
adata = create_islet_dataset_fixed(adata_sc, region='islet_only', min_cells=20)
# .X = protein expression only (31 vars)
# .obsm['X_scVI_mean'] = batch-corrected latent means
# .obsm['phenotype_proportions'] = cell type fractions

# Compute trajectory
sc.pp.neighbors(adata, n_neighbors=15, use_rep='X_scVI_mean', metric='cosine')
sc.tl.paga(adata, groups='donor_status')
sc.tl.umap(adata, init_pos='paga')
# Root: ND islet with highest INS expression
sc.tl.diffmap(adata, n_comps=10)
sc.tl.dpt(adata)

Validation checks:

INS vs pseudotime: Spearman r < -0.3
GCG vs pseudotime: Spearman r > 0.2
Donor status ordering: mean(ND) < mean(Aab+) < mean(T1D)
No single donor > 80% of any pseudotime quintile

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Old pipeline: mixing morphology + protein in `.X` (47 vars)	Morphology features (cell area, nucleus size) dominated the neighbor graph and trajectory, masking biological signal from protein markers	Always keep `.X` as protein-only; store morphology in `.obsm['morphology']`
Old pipeline: n_neighbors=5	Too few neighbors created disconnected/noisy graphs, especially with <1000 islets	Use n_neighbors=15 (standard) for islet-level data; 5 is only appropriate for very dense single-cell data
Old pipeline: PCA-based neighbors (no batch correction)	Donor-specific technical effects drove the UMAP clustering, obscuring disease biology	Use `X_scVI_mean` with cosine metric for batch-corrected neighbor computation
Storing DataFrames directly in `.uns`	H5AD serialization doesn't handle pandas DataFrames well in `.uns` (type conversion issues)	Store each column as a separate numpy array with a naming convention prefix; reconstruct on the R side
Per-key `ad$uns[[col]]` in R loop	Each `ad$uns[[key]]` call crosses reticulate Python→R bridge (~221ms). With 62 cols × 3 sheets = 186 calls = 16 seconds. Profiled as the #1 startup bottleneck	Cache `uns <- ad$uns` as R list (0.2s single crossing), then index from R side. Applies to ALL reticulate accessors (`ad$obs`, `ad$var`, `ad$obsm`)
`reticulate::py$adata <- ad` in `source(f, local=TRUE)`	`py` is a special active binding that fails with `object 'reticulate' not found` when sourced locally. Even `requireNamespace("reticulate")` doesn't help — `py$` needs the package on the search path	Avoid `reticulate::py$` entirely for this use case. Cache R objects (`ad$uns`, `ad$obs`) instead of round-tripping through Python namespace
`<<-` for lazy-load flag across `source()` boundaries	`<<-` assigns to the defining environment, which may not match the reading environment when files are sourced with different `local` args	Use `new.env(parent=emptyenv())` with `$loaded` flag for lazy state. Environment mutation is scope-independent
Defining slash commands in CLAUDE.md	Claude Code only discovers skills from `.claude/skills/<name>/SKILL.md` files — CLAUDE.md is loaded as documentation context only, NOT for skill registration	Always create skill files in `.claude/skills/`; git submodule CLAUDE.md files are not searched for skills
`adata.obs.reset_index()` when index name matches column	`islets_core_fixed.h5ad` has `islet_id` as both the index name AND a column → pandas raises `ValueError: cannot insert islet_id, already exists`	Check `if idx_name in adata.obs.columns: reset_index(drop=True)` before any merge that uses reset_index
`fillna('')` on categorical columns	h5py can't write NaN in string columns, but `.fillna('')` on a Categorical raises `TypeError: Cannot setitem on a Categorical with a new category`	Always `.astype(str)` BEFORE `.fillna('')` for categorical obs columns
Using `CODEX_Pancreas_Donors.xlsx` for donor metadata	The Excel file has a different donor cohort (nPOD pilot: 6090, 6171...) than the CODEX data (6505, 6533...) — ALL merges produce NaN	Derive donor metadata (status, age, gender, AAb flags) from the h5ad obs itself, not from external donor key files. Build `donor_key_df` from `adata.obs.groupby('imageid')`
Missing `combined_islet_id` in rebuilt h5ad	`islets_core_fixed.h5ad` has `islet_id` (e.g., `6505_Islet_284`) but trajectory module references `combined_islet_id`	Add `adata.obs['combined_islet_id'] = adata.obs['islet_id'].copy()` before saving trajectory h5ad
Heatmap `mids` vs `hm` row count mismatch	`mids` computed for all 25 bins but `hm` only has rows for bins with >=3 points → `replacement has 25 rows, data has 24`	Use `match(as.character(hm$pt_bin), bin_levels)` to index `all_mids` so skipped bins are excluded
Diverging colormap for raw expression on UMAP	`scale_color_gradient2(limits=c(-3,3))` assumed z-scored data but `.X` values are raw mean expression — colorbar misleading	Use `scale_color_viridis_c(option="inferno", limits=range(value))` for continuous min/max scaling
Reading full 3.7 GB single-cell H5AD into memory	Python OOM on 16 GB machine. Even `backed='r'` loads `.obs` eagerly (~800 MB)	Use `backed='r'` AND read `.X` in chunks of 50K cells. Pre-filter `.obs` to relevant cells before materializing expression matrix
Storing neighborhood metrics in `.uns`	`.uns` column-by-column convention works for tabular data but is cumbersome for 62+ columns that are per-observation	Use `.obs` for per-observation metrics (1 value per islet). Reserve `.uns` for tabular data that doesn't map 1:1 to observations (like groovy sheets with different row counts)
Column name sanitization: spaces and `+` in phenotype names	`peri_prop_CD8a+ T cell` breaks R column access and H5AD string handling	Sanitize during Python preprocessing: `col.replace(' ', '_').replace('+', 'plus')` → `peri_prop_CD8aplus_T_cell`. Do NOT sanitize at load time in R
66 islets with zero peri-islet cells	Some islets have no `_exp20um` cells in the single-cell H5AD. Setting `total_cells_peri=0` caused division-by-zero in fraction metrics	Set `total_cells_peri=0` but use `NaN` for fraction/ratio metrics. In R, always use `na.rm=TRUE` and check `!is.na()` in conditional filters

Final Parameters

# Islet aggregation (fixed_islet_aggregation.py)
region: islet_only
min_cells: 20
require_paired: false  # Don't require peri-islet for trajectory

# Neighbor computation
n_neighbors: 15
use_rep: X_scVI_mean
metric: cosine

# UMAP
init_pos: paga
min_dist: 0.1
spread: 1.5

# DPT
n_comps: 10  # diffusion map components
root: ND islet with max INS expression

# scVI QC thresholds
silhouette_batch: scVI < PCA
silhouette_celltype: scVI >= 0.8 * PCA
batch_lisi: median(scVI) > median(PCA)

# Python environment
conda_env: scvi-env
python_packages: scanpy==1.11.5, anndata==0.12.4, scvi-tools==1.4.0, scib-metrics

Key Insights

Interface stability: Design H5AD loading to return identical structures as existing Excel loading. This means downstream code (prep_data, modules) needs zero changes. The fallback pattern (load_master_auto) is clean and testable.
.uns for tabular data: H5AD .uns is a flexible dict, but column-by-column storage with a naming convention (groovy_targets_colname) is more robust than trying to store entire DataFrames.
Subsample for QC metrics: scVI validation doesn't need all 2.6M cells. 50K cells (or 10K for LISI) gives reliable metrics in seconds vs minutes.
PAGA-initialized UMAP: init_pos='paga' gives more meaningful UMAP topology than random initialization, especially for trajectory data where group connectivity matters.
Validation gates: Define quantitative pass/fail criteria (r < -0.3 for INS, etc.) before running the pipeline. This prevents post-hoc rationalization of poor trajectories.
Claude Code skills: .claude/skills/<name>/SKILL.md is the ONLY mechanism for registering slash commands. CLAUDE.md, git submodules, and plugin.json files are NOT discovered as skills.
Donor metadata provenance: Never assume an external Excel "donor key" matches your data cohort. Build donor metadata from the canonical h5ad obs itself — it already has donor_status, age, gender, and AAb flags from the single-cell phenotyping.
Column name compatibility: When h5ad obs columns get renamed across pipeline steps (e.g., islet_id → combined_islet_id), add the expected alias before saving. Check what downstream consumers expect by grepping the app code.
H5AD write gotchas: Before adata.write_h5ad(), iterate all obs columns and convert categoricals → str then fillna(''). Also check that obs index name doesn't collide with a column name.
.obs vs .uns decision rule: Use .obs for metrics with 1 value per observation (neighborhood metrics, phenotype proportions). Use .uns for tabular data with different row counts (groovy exports with Case ID + islet_key as join keys).
Backed mode + chunking: For 3.7 GB single-cell H5ADs, backed='r' keeps .X on disk. But .obs is still loaded eagerly. Read .X in chunks of 50K cells to avoid OOM.
Column name sanitization pipeline: Sanitize in Python (space→_, +→plus) during preprocessing, NOT at R load time. This ensures H5AD column names are valid R identifiers. Pattern: col.replace(' ', '_').replace('+', 'plus').
Graceful NaN handling for missing peri data: 66/1,015 islets lack peri-islet cells. Use NaN for fraction metrics, 0 for count metrics. R side: na.rm=TRUE everywhere, conditional UI that checks !all(is.na(x)).

Execution Results (2026-02-17)

Metric	Value
scVI batch LISI	2.73 (PCA) → 8.07 (scVI)
INS vs pseudotime	r = -0.741
GCG vs pseudotime	r = 0.379
Donor ordering	ND=0.491 < Aab+=0.516 < T1D=0.743
Max donor quintile fraction	0.38 (threshold: <0.80)
Islets	1,015 (31 protein vars)
islet_explorer.h5ad	47 MB
H5AD vs Excel prep_data	targets=48,438, markers=64,584, comp=5,382 (identical)
Total validation checks	33/33 passed

Performance Optimization Results (2026-03-05)

Metric	Before	After
`reconstruct_groovy_df()` × 3	16s (186 bridge crossings)	0.2s (1 crossing + R list)
`annotations.tsv` eager load	3s (72 MB TSV)	0s (deferred, never loaded)
File sourcing	~4s	~1.5s
`anndata::read_h5ad()`	1.3s	1.3s (unchanged)
`prep_data()`	0.6s	0.7s (unchanged)
Total to first plot	~22s	~3.5s

7. Merging Leiden clustering from external H5AD (Phase 9, 2026-02-20)

Merge islet-level Leiden cluster assignments and UMAP coords from a separate clustering H5AD:

# scripts/build_h5ad_for_app.py — step 4.7
clust = ad.read_h5ad("islet_analysis/islets_core_clustered.h5ad")
leiden_cols = [c for c in clust.obs.columns if c.startswith("leiden_")]
for col in leiden_cols:
    adata.obs[col] = clust.obs[col].astype(str).reindex(adata.obs.index)
# UMAP from .obsm['X_umap'] → .obs columns for easy R extraction
umap_df = pd.DataFrame(clust.obsm['X_umap'], index=clust.obs.index,
                        columns=['leiden_umap_1', 'leiden_umap_2'])
for col in umap_df.columns:
    adata.obs[col] = umap_df[col].reindex(adata.obs.index)

Key: both H5ADs indexed by islet_id, so reindex() aligns correctly. Categorical Leiden columns → .astype(str) for h5py compatibility.

8. Per-donor tissue extraction for Spatial tab (Phase 9, 2026-02-20)

Extract ALL cells per donor (core + peri + tissue background) for whole-tissue scatter visualization:

# scripts/extract_per_donor_tissue.py → data/donors/{imageid}.csv
# 15 files, 2.65M total cells, ~78 MB
# Only 5 columns: X_centroid, Y_centroid, phenotype, cell_region, islet_name
# cell_region: "core" (Islet_N), "peri" (Islet_N_exp20um), "tissue" (all others)
obs = sc.obs[["imageid", "Parent", "phenotype", "X_centroid", "Y_centroid"]].copy()
# Parse Parent → cell_region + islet_name
# Group by imageid, write per-donor CSVs

Design choices:

No expression data (31 markers) — reduces 450+ MB → 78 MB. App only needs spatial coords + metadata for scatter.
Not backed-mode since we only read .obs (no .X access needed) — fast enough.
Donor 6533 has 0 core/peri cells (205K tissue-only) — no islet annotations in that sample.

Execution Results — Phase 7-8 (2026-02-19)

Metric	Value
Neighborhood metrics CSV	1,015 rows × 62 columns
Islets with peri data	949/1,015 (93.5%)
Biological signal (immune_frac_peri)	T1D=0.155, Aab+=0.106, ND=0.069
Per-islet cell CSVs	949 files, 180,019 cells, 110.8 MB
Columns per cell CSV	37 (X/Y, phenotype, region, morphology, 31 markers)
Rebuilt islet_explorer.h5ad	47.6 MB (+60 .obs columns)
R syntax check	All files pass `parse()`
App smoke test	Loads without errors

Execution Results — Phase 9 (2026-02-20)

Metric	Value
Per-donor tissue CSVs	15 files, 2,649,654 total cells, 77.8 MB
Avg cells/donor	176,644 (range: 131K-218K)
Leiden columns merged	4 (leiden_0.3: 4 clusters, leiden_0.5: 7, leiden_0.8: 12, leiden_1.0: 14)
UMAP coords added	leiden_umap_1, leiden_umap_2
Rebuilt islet_explorer.h5ad	47.6 MB (+6 leiden-related .obs columns)
Donor 6533 (edge case)	205K cells but 0 core, 0 peri (tissue background only)
R syntax check	All files parse successfully
App HTTP 200	Confirmed after worker restart

References

AnnData format — .uns for arbitrary metadata, .obs for per-observation metrics
scVI-tools — Batch correction with biological covariates
scib-metrics — Batch integration benchmarking
Scanpy trajectory analysis — DPT, PAGA, diffusion maps
scipy.spatial.cKDTree — Efficient nearest-neighbor distance computation for immune cell distances
Islet Explorer: data/DATA_PROVENANCE.md for full lineage documentation

name: h5ad-shiny-data-pipeline description: "Patterns for H5AD-backed Shiny apps with Excel fallback, groovy data in .uns, scVI QC validation, Leiden clustering merge, per-donor tissue extraction" author: smith6jt date: 2026-02-20