h5ad-shiny-data-pipeline

star 2

Patterns for H5AD-backed Shiny apps with Excel fallback, groovy data in .uns, scVI QC validation, Leiden clustering merge, per-donor tissue extraction

smith6jt-cop By smith6jt-cop schedule Updated 3/5/2026

name: h5ad-shiny-data-pipeline description: "Patterns for H5AD-backed Shiny apps with Excel fallback, groovy data in .uns, scVI QC validation, Leiden clustering merge, per-donor tissue extraction" author: smith6jt date: 2026-02-20

H5AD → Shiny Data Pipeline - Research Notes

Experiment Overview

Item Details
Date 2026-02-17
Goal Replace scattered Excel data loading with a single validated H5AD pipeline, while maintaining backward compatibility with existing Excel-based loading
Environment R 4.x, Shiny 1.12.1, anndata (R pkg), Python: scanpy 1.11.5, anndata 0.12.4, scvi-tools 1.4.0, scib-metrics, conda env scvi-env
Status Success

Context

The Islet Explorer app loaded data from master_results.xlsx (4 sheets: targets, markers, composition, LGALS3), built from per-donor TSV exports. The trajectory tab loaded a separate adata_ins_root.h5ad that was produced by a problematic pipeline (mixed morphology+protein features, weak n_neighbors=5, missing batch correction). Phase 2 rebuilt the pipeline and unified data loading.

Key problems with the old pipeline:

  • .X contained 47 variables (proteins + morphology mixed) — morphology dominated trajectory
  • n_neighbors=5 (non-standard, too sparse)
  • No X_scVI_mean (no batch correction in embeddings)
  • Missing phenotype_proportions in .obsm
  • 5,158 islets (combined core+peri regions)
  • Donor 6533 potentially missing

Verified Workflow

1. Store groovy/TSV tabular data in H5AD .uns

H5AD's .uns dict can store arbitrary arrays. For tabular data (like QuPath TSV exports), store each column as a separate array with a naming convention:

# Python: scripts/build_h5ad_for_app.py
for col in targets_df.columns:
    vals = targets_df[col].values
    key = f'groovy_targets_{col}'
    if pd.api.types.is_bool_dtype(vals):
        adata.uns[key] = vals.astype(bool)
    elif pd.api.types.is_numeric_dtype(vals):
        adata.uns[key] = np.array(vals, dtype=float)
    else:
        adata.uns[key] = np.array(vals, dtype=str)

# Also store column names and row count for reconstruction
adata.uns['groovy_targets_columns'] = list(targets_df.columns)
adata.uns['groovy_targets_n_rows'] = len(targets_df)
# R: Reconstruct DataFrame from .uns arrays
reconstruct_groovy_df <- function(ad, sheet) {
  prefix <- paste0("groovy_", sheet, "_")
  cols_key <- paste0("groovy_", sheet, "_columns")
  col_names <- ad$uns[[cols_key]]
  n_rows <- as.integer(ad$uns[[paste0("groovy_", sheet, "_n_rows")]])
  df <- data.frame(row.names = seq_len(n_rows))
  for (col in col_names) {
    df[[col]] <- ad$uns[[paste0(prefix, col)]]
  }
  df
}

1b. CRITICAL: Cache ad$uns before column access (Mar 2026 optimization)

Every ad$uns[[key]] call crosses the reticulate Python→R bridge (~221ms each). With 62+ groovy columns × 3 sheets, this creates a 16-second bottleneck. Fix: cache the entire .uns dict as an R list in a single bridge crossing:

# WRONG (16s): 186 individual bridge crossings
targets <- reconstruct_groovy_df(ad, "targets")  # calls ad$uns[[key]] 62 times

# RIGHT (0.2s): single bridge crossing, then R list indexing
uns <- ad$uns  # one crossing, returns full R list
targets <- reconstruct_groovy_df_from_list(uns, "targets")  # indexes R list (instant)
reconstruct_groovy_df_from_list <- function(uns_list, sheet) {
  prefix <- paste0("groovy_", sheet, "_")
  col_names <- uns_list[[paste0("groovy_", sheet, "_columns")]]
  n_rows <- as.integer(uns_list[[paste0("groovy_", sheet, "_n_rows")]])
  if (is.null(col_names) || n_rows == 0) return(NULL)
  df <- data.frame(row.names = seq_len(n_rows))
  for (col in col_names) df[[col]] <- uns_list[[paste0(prefix, col)]]
  if ("Case ID" %in% names(df)) df[["Case ID"]] <- suppressWarnings(as.integer(df[["Case ID"]]))
  df
}

This applies to ANY reticulate accessor: ad$obs[[col]], ad$var[[col]], ad$obsm[["key"]]. Always cache the parent object as an R object first, then index from the R side.

2. H5AD → Excel fallback in Shiny data loading

Design the H5AD loader to return the same list structure as the existing Excel loader. This means prep_data() needs zero changes.

# In data_loading.R
load_master_auto <- function(h5ad_path = NULL, excel_path = master_path) {
  if (!is.null(h5ad_path) && file.exists(h5ad_path)) {
    result <- load_master_h5ad(h5ad_path)
    if (!is.null(result)) return(result)
  }
  if (file.exists(excel_path)) return(load_master(excel_path))
  stop("No data source found")
}

# In app.R server
master <- reactive({
  load_master_auto(h5ad_path = h5ad_path, excel_path = master_path)
})
# prepared() calls prep_data(master()) — unchanged

3. scVI batch correction validation

Validate with three complementary metrics on a subsample (50K cells for efficiency):

# Silhouette batch score (lower = better mixing)
sil_pca = silhouette_score(X_pca, batch_labels, sample_size=10000)
sil_scvi = silhouette_score(X_scVI, batch_labels, sample_size=10000)
# Pass: sil_scvi < sil_pca

# Cell-type silhouette (higher = better separation, preserved biology)
sil_ct_scvi = silhouette_score(X_scVI, celltype_labels, sample_size=10000)
# Pass: sil_ct_scvi >= 0.8 * sil_ct_pca

# LISI (custom implementation using k-NN + inverse Simpson's index)
# Batch LISI: higher = better mixing (ideal = n_batches)
# Cell-type LISI: lower = better separation (ideal = 1)

3b. scVI covariate validation (2026-02-23)

The canonical model uses categorical_covariate_keys=["Age", "Gender"] with NO batch_key. Since each donor has a unique Age value, this effectively provides soft donor correction. A full retraining comparison confirmed the original model wins 4/5 key metrics vs a no-covariate alternative (donor silhouette: -0.04 vs +0.25; PC0 pseudotime r: 0.71 vs 0.10). See the dedicated scvi-covariate-validation skill for full methodology, metrics, and failed attempts.

4. Neighborhood metrics in .obs (Phase 7, 2026-02-19)

Store per-islet neighborhood metrics as additional .obs columns rather than .uns:

# scripts/compute_neighborhood_metrics.py → data/neighborhood_metrics.csv
# Then merged into adata.obs by build_h5ad_for_app.py (step 4.5)
nbr = pd.read_csv(nbr_path)
adata.obs['_combined_islet_id'] = (
    adata.obs['imageid'].astype(str) + '_' + adata.obs['base_islet_id'].astype(str)
)
nbr_merge = nbr.set_index('combined_islet_id')[nbr_cols]
adata.obs = adata.obs.join(nbr_merge, on='_combined_islet_id', how='left')

62 columns across 4 categories: peri-islet composition (peri_prop_*, peri_count_*), immune infiltration (immune_frac_*, cd8_to_macro_ratio, tcell_density_peri), enrichment z-scores (enrich_z_*), and distance metrics (min_dist_*). Extract on the R side with grep("^peri_prop_|^immune_|...", colnames(obs)).

5. Per-islet cell CSVs for drill-down (Phase 8, 2026-02-19)

Extract individual cells per qualified islet into small CSV files for client-side rendering:

# scripts/extract_per_islet_cells.py → data/cells/{imageid}_Islet_{N}.csv
# 949 files, 180K cells, 37 cols (X/Y, phenotype, region, morphology, 31 markers)
# Read backed H5AD in chunks of 50K cells for memory efficiency
adata_sc = ad.read_h5ad(sc_path, backed='r')
chunk_size = 50000
for start in range(0, adata_sc.n_obs, chunk_size):
    chunk = adata_sc[start:end].to_memory()
    # ... filter, extract, write per-islet CSVs

On the R side, use environment-based caching (drilldown_cache <- new.env()) to avoid re-reading CSVs on repeated clicks.

6. Trajectory rebuild from fixed pipeline

The fixed aggregation keeps only proteins in .X and uses X_scVI_mean for neighbors:

# Aggregate with fixed_islet_aggregation.py
adata = create_islet_dataset_fixed(adata_sc, region='islet_only', min_cells=20)
# .X = protein expression only (31 vars)
# .obsm['X_scVI_mean'] = batch-corrected latent means
# .obsm['phenotype_proportions'] = cell type fractions

# Compute trajectory
sc.pp.neighbors(adata, n_neighbors=15, use_rep='X_scVI_mean', metric='cosine')
sc.tl.paga(adata, groups='donor_status')
sc.tl.umap(adata, init_pos='paga')
# Root: ND islet with highest INS expression
sc.tl.diffmap(adata, n_comps=10)
sc.tl.dpt(adata)

Validation checks:

  • INS vs pseudotime: Spearman r < -0.3
  • GCG vs pseudotime: Spearman r > 0.2
  • Donor status ordering: mean(ND) < mean(Aab+) < mean(T1D)
  • No single donor > 80% of any pseudotime quintile

Failed Attempts (Critical)

Attempt Why it Failed Lesson Learned
Old pipeline: mixing morphology + protein in .X (47 vars) Morphology features (cell area, nucleus size) dominated the neighbor graph and trajectory, masking biological signal from protein markers Always keep .X as protein-only; store morphology in .obsm['morphology']
Old pipeline: n_neighbors=5 Too few neighbors created disconnected/noisy graphs, especially with <1000 islets Use n_neighbors=15 (standard) for islet-level data; 5 is only appropriate for very dense single-cell data
Old pipeline: PCA-based neighbors (no batch correction) Donor-specific technical effects drove the UMAP clustering, obscuring disease biology Use X_scVI_mean with cosine metric for batch-corrected neighbor computation
Storing DataFrames directly in .uns H5AD serialization doesn't handle pandas DataFrames well in .uns (type conversion issues) Store each column as a separate numpy array with a naming convention prefix; reconstruct on the R side
Per-key ad$uns[[col]] in R loop Each ad$uns[[key]] call crosses reticulate Python→R bridge (~221ms). With 62 cols × 3 sheets = 186 calls = 16 seconds. Profiled as the #1 startup bottleneck Cache uns <- ad$uns as R list (0.2s single crossing), then index from R side. Applies to ALL reticulate accessors (ad$obs, ad$var, ad$obsm)
reticulate::py$adata <- ad in source(f, local=TRUE) py is a special active binding that fails with object 'reticulate' not found when sourced locally. Even requireNamespace("reticulate") doesn't help — py$ needs the package on the search path Avoid reticulate::py$ entirely for this use case. Cache R objects (ad$uns, ad$obs) instead of round-tripping through Python namespace
<<- for lazy-load flag across source() boundaries <<- assigns to the defining environment, which may not match the reading environment when files are sourced with different local args Use new.env(parent=emptyenv()) with $loaded flag for lazy state. Environment mutation is scope-independent
Defining slash commands in CLAUDE.md Claude Code only discovers skills from .claude/skills/<name>/SKILL.md files — CLAUDE.md is loaded as documentation context only, NOT for skill registration Always create skill files in .claude/skills/; git submodule CLAUDE.md files are not searched for skills
adata.obs.reset_index() when index name matches column islets_core_fixed.h5ad has islet_id as both the index name AND a column → pandas raises ValueError: cannot insert islet_id, already exists Check if idx_name in adata.obs.columns: reset_index(drop=True) before any merge that uses reset_index
fillna('') on categorical columns h5py can't write NaN in string columns, but .fillna('') on a Categorical raises TypeError: Cannot setitem on a Categorical with a new category Always .astype(str) BEFORE .fillna('') for categorical obs columns
Using CODEX_Pancreas_Donors.xlsx for donor metadata The Excel file has a different donor cohort (nPOD pilot: 6090, 6171...) than the CODEX data (6505, 6533...) — ALL merges produce NaN Derive donor metadata (status, age, gender, AAb flags) from the h5ad obs itself, not from external donor key files. Build donor_key_df from adata.obs.groupby('imageid')
Missing combined_islet_id in rebuilt h5ad islets_core_fixed.h5ad has islet_id (e.g., 6505_Islet_284) but trajectory module references combined_islet_id Add adata.obs['combined_islet_id'] = adata.obs['islet_id'].copy() before saving trajectory h5ad
Heatmap mids vs hm row count mismatch mids computed for all 25 bins but hm only has rows for bins with >=3 points → replacement has 25 rows, data has 24 Use match(as.character(hm$pt_bin), bin_levels) to index all_mids so skipped bins are excluded
Diverging colormap for raw expression on UMAP scale_color_gradient2(limits=c(-3,3)) assumed z-scored data but .X values are raw mean expression — colorbar misleading Use scale_color_viridis_c(option="inferno", limits=range(value)) for continuous min/max scaling
Reading full 3.7 GB single-cell H5AD into memory Python OOM on 16 GB machine. Even backed='r' loads .obs eagerly (~800 MB) Use backed='r' AND read .X in chunks of 50K cells. Pre-filter .obs to relevant cells before materializing expression matrix
Storing neighborhood metrics in .uns .uns column-by-column convention works for tabular data but is cumbersome for 62+ columns that are per-observation Use .obs for per-observation metrics (1 value per islet). Reserve .uns for tabular data that doesn't map 1:1 to observations (like groovy sheets with different row counts)
Column name sanitization: spaces and + in phenotype names peri_prop_CD8a+ T cell breaks R column access and H5AD string handling Sanitize during Python preprocessing: col.replace(' ', '_').replace('+', 'plus')peri_prop_CD8aplus_T_cell. Do NOT sanitize at load time in R
66 islets with zero peri-islet cells Some islets have no _exp20um cells in the single-cell H5AD. Setting total_cells_peri=0 caused division-by-zero in fraction metrics Set total_cells_peri=0 but use NaN for fraction/ratio metrics. In R, always use na.rm=TRUE and check !is.na() in conditional filters

Final Parameters

# Islet aggregation (fixed_islet_aggregation.py)
region: islet_only
min_cells: 20
require_paired: false  # Don't require peri-islet for trajectory

# Neighbor computation
n_neighbors: 15
use_rep: X_scVI_mean
metric: cosine

# UMAP
init_pos: paga
min_dist: 0.1
spread: 1.5

# DPT
n_comps: 10  # diffusion map components
root: ND islet with max INS expression

# scVI QC thresholds
silhouette_batch: scVI < PCA
silhouette_celltype: scVI >= 0.8 * PCA
batch_lisi: median(scVI) > median(PCA)

# Python environment
conda_env: scvi-env
python_packages: scanpy==1.11.5, anndata==0.12.4, scvi-tools==1.4.0, scib-metrics

Key Insights

  • Interface stability: Design H5AD loading to return identical structures as existing Excel loading. This means downstream code (prep_data, modules) needs zero changes. The fallback pattern (load_master_auto) is clean and testable.
  • .uns for tabular data: H5AD .uns is a flexible dict, but column-by-column storage with a naming convention (groovy_targets_colname) is more robust than trying to store entire DataFrames.
  • Subsample for QC metrics: scVI validation doesn't need all 2.6M cells. 50K cells (or 10K for LISI) gives reliable metrics in seconds vs minutes.
  • PAGA-initialized UMAP: init_pos='paga' gives more meaningful UMAP topology than random initialization, especially for trajectory data where group connectivity matters.
  • Validation gates: Define quantitative pass/fail criteria (r < -0.3 for INS, etc.) before running the pipeline. This prevents post-hoc rationalization of poor trajectories.
  • Claude Code skills: .claude/skills/<name>/SKILL.md is the ONLY mechanism for registering slash commands. CLAUDE.md, git submodules, and plugin.json files are NOT discovered as skills.
  • Donor metadata provenance: Never assume an external Excel "donor key" matches your data cohort. Build donor metadata from the canonical h5ad obs itself — it already has donor_status, age, gender, and AAb flags from the single-cell phenotyping.
  • Column name compatibility: When h5ad obs columns get renamed across pipeline steps (e.g., islet_idcombined_islet_id), add the expected alias before saving. Check what downstream consumers expect by grepping the app code.
  • H5AD write gotchas: Before adata.write_h5ad(), iterate all obs columns and convert categoricals → str then fillna(''). Also check that obs index name doesn't collide with a column name.
  • .obs vs .uns decision rule: Use .obs for metrics with 1 value per observation (neighborhood metrics, phenotype proportions). Use .uns for tabular data with different row counts (groovy exports with Case ID + islet_key as join keys).
  • Backed mode + chunking: For 3.7 GB single-cell H5ADs, backed='r' keeps .X on disk. But .obs is still loaded eagerly. Read .X in chunks of 50K cells to avoid OOM.
  • Column name sanitization pipeline: Sanitize in Python (space→_, +→plus) during preprocessing, NOT at R load time. This ensures H5AD column names are valid R identifiers. Pattern: col.replace(' ', '_').replace('+', 'plus').
  • Graceful NaN handling for missing peri data: 66/1,015 islets lack peri-islet cells. Use NaN for fraction metrics, 0 for count metrics. R side: na.rm=TRUE everywhere, conditional UI that checks !all(is.na(x)).

Execution Results (2026-02-17)

Metric Value
scVI batch LISI 2.73 (PCA) → 8.07 (scVI)
INS vs pseudotime r = -0.741
GCG vs pseudotime r = 0.379
Donor ordering ND=0.491 < Aab+=0.516 < T1D=0.743
Max donor quintile fraction 0.38 (threshold: <0.80)
Islets 1,015 (31 protein vars)
islet_explorer.h5ad 47 MB
H5AD vs Excel prep_data targets=48,438, markers=64,584, comp=5,382 (identical)
Total validation checks 33/33 passed

Performance Optimization Results (2026-03-05)

Metric Before After
reconstruct_groovy_df() × 3 16s (186 bridge crossings) 0.2s (1 crossing + R list)
annotations.tsv eager load 3s (72 MB TSV) 0s (deferred, never loaded)
File sourcing ~4s ~1.5s
anndata::read_h5ad() 1.3s 1.3s (unchanged)
prep_data() 0.6s 0.7s (unchanged)
Total to first plot ~22s ~3.5s

7. Merging Leiden clustering from external H5AD (Phase 9, 2026-02-20)

Merge islet-level Leiden cluster assignments and UMAP coords from a separate clustering H5AD:

# scripts/build_h5ad_for_app.py — step 4.7
clust = ad.read_h5ad("islet_analysis/islets_core_clustered.h5ad")
leiden_cols = [c for c in clust.obs.columns if c.startswith("leiden_")]
for col in leiden_cols:
    adata.obs[col] = clust.obs[col].astype(str).reindex(adata.obs.index)
# UMAP from .obsm['X_umap'] → .obs columns for easy R extraction
umap_df = pd.DataFrame(clust.obsm['X_umap'], index=clust.obs.index,
                        columns=['leiden_umap_1', 'leiden_umap_2'])
for col in umap_df.columns:
    adata.obs[col] = umap_df[col].reindex(adata.obs.index)

Key: both H5ADs indexed by islet_id, so reindex() aligns correctly. Categorical Leiden columns → .astype(str) for h5py compatibility.

8. Per-donor tissue extraction for Spatial tab (Phase 9, 2026-02-20)

Extract ALL cells per donor (core + peri + tissue background) for whole-tissue scatter visualization:

# scripts/extract_per_donor_tissue.py → data/donors/{imageid}.csv
# 15 files, 2.65M total cells, ~78 MB
# Only 5 columns: X_centroid, Y_centroid, phenotype, cell_region, islet_name
# cell_region: "core" (Islet_N), "peri" (Islet_N_exp20um), "tissue" (all others)
obs = sc.obs[["imageid", "Parent", "phenotype", "X_centroid", "Y_centroid"]].copy()
# Parse Parent → cell_region + islet_name
# Group by imageid, write per-donor CSVs

Design choices:

  • No expression data (31 markers) — reduces 450+ MB → 78 MB. App only needs spatial coords + metadata for scatter.
  • Not backed-mode since we only read .obs (no .X access needed) — fast enough.
  • Donor 6533 has 0 core/peri cells (205K tissue-only) — no islet annotations in that sample.

Execution Results — Phase 7-8 (2026-02-19)

Metric Value
Neighborhood metrics CSV 1,015 rows × 62 columns
Islets with peri data 949/1,015 (93.5%)
Biological signal (immune_frac_peri) T1D=0.155, Aab+=0.106, ND=0.069
Per-islet cell CSVs 949 files, 180,019 cells, 110.8 MB
Columns per cell CSV 37 (X/Y, phenotype, region, morphology, 31 markers)
Rebuilt islet_explorer.h5ad 47.6 MB (+60 .obs columns)
R syntax check All files pass parse()
App smoke test Loads without errors

Execution Results — Phase 9 (2026-02-20)

Metric Value
Per-donor tissue CSVs 15 files, 2,649,654 total cells, 77.8 MB
Avg cells/donor 176,644 (range: 131K-218K)
Leiden columns merged 4 (leiden_0.3: 4 clusters, leiden_0.5: 7, leiden_0.8: 12, leiden_1.0: 14)
UMAP coords added leiden_umap_1, leiden_umap_2
Rebuilt islet_explorer.h5ad 47.6 MB (+6 leiden-related .obs columns)
Donor 6533 (edge case) 205K cells but 0 core, 0 peri (tissue background only)
R syntax check All files parse successfully
App HTTP 200 Confirmed after worker restart

References

  • AnnData format.uns for arbitrary metadata, .obs for per-observation metrics
  • scVI-tools — Batch correction with biological covariates
  • scib-metrics — Batch integration benchmarking
  • Scanpy trajectory analysis — DPT, PAGA, diffusion maps
  • scipy.spatial.cKDTree — Efficient nearest-neighbor distance computation for immune cell distances
  • Islet Explorer: data/DATA_PROVENANCE.md for full lineage documentation
Install via CLI
npx skills add https://github.com/smith6jt-cop/Skills_Registry --skill h5ad-shiny-data-pipeline
Repository Details
star Stars 2
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
smith6jt-cop
smith6jt-cop Explore all skills →