tooluniverse-single-cell

star 1.5k

Single-cell RNA-seq analysis with scanpy/anndata — h5ad data loading, scRNA-seq quality control and QC gating (n_genes_by_counts, total_counts, mitochondrial percent / pct_counts_mt, pct_counts_ribo, doublet detection with Scrublet/scDblFinder, ambient RNA / SoupX awareness, empty-droplet filtering, MAD-based thresholds), normalization, dimensionality reduction (PCA, UMAP, t-SNE), clustering (Leiden, Louvain), marker gene identification, cell-type annotation, pseudotime/trajectory analysis. Use for any scRNA-seq workflow, including deciding which cells to filter, flag, or investigate before downstream analysis.

mims-harvard By mims-harvard schedule Updated 6/12/2026

name: tooluniverse-single-cell description: Single-cell RNA-seq analysis with scanpy/anndata — h5ad data loading, scRNA-seq quality control and QC gating (n_genes_by_counts, total_counts, mitochondrial percent / pct_counts_mt, pct_counts_ribo, doublet detection with Scrublet/scDblFinder, ambient RNA / SoupX awareness, empty-droplet filtering, MAD-based thresholds), normalization, dimensionality reduction (PCA, UMAP, t-SNE), clustering (Leiden, Louvain), marker gene identification, cell-type annotation, pseudotime/trajectory analysis. Use for any scRNA-seq workflow, including deciding which cells to filter, flag, or investigate before downstream analysis. disable-model-invocation: true

Single-Cell Genomics and Expression Matrix Analysis

RULE ZERO — Check for pre-computed results FIRST

Before following any instruction below, scan the data folder for:

  • *_executed.ipynb → read with tu run read_executed_notebook '{"data_folder":"<path>","search":"<keyword>"}' and cite its cell outputs as the authoritative answer
  • Pre-computed result files (CSV/TSV with names like *results*, *deseq*, *enrich*, *stats*, *_simplified.csv) → read directly and report the requested value
  • Canonical analysis scripts (analysis.R, run_*.py, find_*.R, *.Rmd) → execute as-is and read the output

Only follow this skill's re-analysis recipe below if none of the above exist. Re-running from raw data produces different numbers than the published answer and is much slower (often 5-10× turn count).


Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse.


LOOK UP, DON'T GUESS

When uncertain about any scientific fact, SEARCH databases first (PubMed, UniProt, ChEMBL, ClinVar, etc.) rather than reasoning from memory. A database-verified answer is always more reliable than a guess.


When to Use This Skill

Apply when users:

  • Have scRNA-seq data (h5ad, 10X, CSV count matrices) and want analysis
  • Need scRNA-seq quality control / QC gating: deciding cell filters by mito % (pct_counts_mt), gene/UMI counts, doublets, ambient RNA, empty droplets
  • Ask about cell type identification, clustering, or annotation
  • Need differential expression analysis by cell type or condition
  • Want gene-expression correlation analysis (e.g., gene length vs expression by cell type)
  • Ask about PCA, UMAP, t-SNE for expression data
  • Need Leiden/Louvain clustering on expression matrices
  • Want statistical comparisons between cell types (t-test, ANOVA, fold change)
  • Ask about marker genes, batch correction, trajectory, or cell-cell communication

NOT for (use other skills instead):

  • Bulk RNA-seq DESeq2 only -> tooluniverse-rnaseq-deseq2
  • Gene enrichment only -> tooluniverse-gene-enrichment
  • VCF/variant analysis -> tooluniverse-variant-analysis

Core Principles

  1. Data-first - Load, inspect, validate before analysis
  2. AnnData-centric - All data flows through anndata objects
  3. Cell type awareness - Per-cell-type subsetting when needed
  4. Statistical rigor - Normalization, multiple testing correction, effect sizes
  5. Question-driven - Parse what the user is actually asking

Required Packages

import scanpy as sc, anndata as ad, pandas as pd, numpy as np
from scipy import stats
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.decomposition import PCA
from statsmodels.stats.multitest import multipletests
import gseapy as gp  # enrichment
import harmonypy     # batch correction (optional)

Install: pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodels


Workflow Decision Tree

START: User question about scRNA-seq data
|
+-- FULL PIPELINE (raw counts -> annotated clusters)
|   Workflow: QC -> Normalize -> HVG -> PCA -> Cluster -> Annotate -> DE
|   See: references/scanpy_workflow.md
|
+-- DIFFERENTIAL EXPRESSION (per-cell-type comparison)
|   Most common pattern: per-cell-type DE
|   See: analysis_patterns.md "Pattern 1"
|
+-- CORRELATION ANALYSIS (gene property vs expression)
|   Pattern: Gene length vs expression correlation
|   See: analysis_patterns.md "Pattern 2"
|
+-- CLUSTERING & PCA (expression matrix analysis)
|   See: references/clustering_guide.md
|
+-- CELL COMMUNICATION (ligand-receptor interactions)
|   See: references/cell_communication.md
|
+-- TRAJECTORY ANALYSIS (pseudotime)
    See: references/trajectory_analysis.md

Data format handling:

  • h5ad -> sc.read_h5ad()
  • 10X -> sc.read_10x_mtx() or sc.read_10x_h5()
  • CSV/TSV -> pd.read_csv() -> Convert to AnnData (check orientation!)

Data Loading

AnnData expects: cells/samples as rows (obs), genes as columns (var)

adata = sc.read_h5ad("data.h5ad")  # h5ad already oriented

# CSV/TSV: check orientation
df = pd.read_csv("counts.csv", index_col=0)
if df.shape[0] > df.shape[1] * 5:  # genes > samples by 5x => transpose
    df = df.T
adata = ad.AnnData(df)

# Load metadata
meta = pd.read_csv("metadata.csv", index_col=0)
common = adata.obs_names.intersection(meta.index)
adata = adata[common].copy()
for col in meta.columns:
    adata.obs[col] = meta.loc[common, col]

Quality Control and QC Gating (do this BEFORE downstream analysis)

QC gating decides which cells and genes are real before normalization, clustering, or DE. Skipping or rushing it propagates silently: doublets become fake "intermediate" states, ambient RNA smears markers across clusters, and empty droplets inflate cell counts. Never report a filtered cell count without the gates applied, and never report cutoffs you did not actually run.

HONEST EXECUTION: QC runs scanpy/AnnData via Bash/Python. If scanpy is not installed, do NOT fabricate metrics — print the install plan and stop: python scripts/scrna_qc.py --install-plan (exits 0, lists what is missing, suggests pip install scanpy anndata scrublet).

Per-cell QC metrics (what to compute)

vn = adata.var_names.str.upper()
adata.var['mt']   = vn.str.startswith('MT-')            # mitochondrial
adata.var['ribo'] = vn.str.startswith(('RPS', 'RPL'))   # ribosomal protein
adata.var['hb']   = vn.str.contains(r'^HB[^P]', regex=True)  # hemoglobin (RBC)
sc.pp.calculate_qc_metrics(
    adata, qc_vars=['mt', 'ribo', 'hb'],
    percent_top=None,   # REQUIRED for small gene panels (<500) — else IndexError
    log1p=True, inplace=True)

Key metrics in adata.obs: n_genes_by_counts, total_counts, pct_counts_mt, pct_counts_ribo, pct_counts_hb.

WHY each metric flags a problem (reason from biology, not magic numbers)

  • High pct_counts_mt -> dying / stressed cell. A ruptured membrane lets cytoplasmic mRNA leak out while mito transcripts stay trapped, enriching the captured RNA for mito. Cutoff is tissue-dependent (cardiomyocytes/hepatocytes are mito-rich at baseline — a blanket 10% would discard healthy cells).
  • Low n_genes_by_counts / total_counts -> empty droplet or debris (only ambient RNA captured; few genes, low depth).
  • Very high counts/genes -> doublet (two transcriptomes ~double depth and diversity) — but high count alone is weak; use a doublet caller (below).
  • High pct_counts_hb -> RBC/blood contamination in solid tissue.

Choosing thresholds — distribution-aware (MAD), not hardcoded

Hardcoded cutoffs (mt<5%, n_genes<2500) are a starting point only; they break on mito-rich tissues and on shallow vs deep libraries. Prefer a MAD-based rule (robust to the very outliers you are removing): flag cells

nmads median-absolute-deviations from the median. Use nmads=5 on log1p counts/genes (both tails), nmads=3 upper-only on pct_counts_mt, and pair mito with a biological ceiling so a uniformly degraded sample doesn't pass. Always visualize distributions first (violin + total_counts vs pct_counts_mt scatter — dying cells sit in the low-count/high-mito corner).

Run the helper (computes metrics + MAD gating, reports per-step removals):

python scripts/scrna_qc.py data.h5ad --doublets        # or --install-plan first

Doublets, ambient RNA, empty droplets (per-cell metrics miss these)

  • Doublets: Scrublet (sc.pp.scrublet, scanpy >=1.10) or scDblFinder (R). Run per sample before merging; flag-cluster-drop (doublets form bridge clusters). expected_doublet_rate ~0.8%/1,000 cells recovered (10x).
  • Ambient RNA: cell-free "soup" mRNA in every droplet — a count-correction step (SoupX / DecontX, R), NOT a cell filter. Per-cell QC cannot detect it; suspect it when markers look implausibly ubiquitous.
  • Empty droplets: upstream of per-cell QC. CellRanger's filtered matrix already applies an EmptyDrops-style call; with only the raw matrix, run EmptyDrops (DropletUtils, R) or a barcode-rank knee before per-cell QC.

QC Interpretation Table (metric -> concern -> action)

QC metric Direction Typical concern Suggested action
pct_counts_mt high Dying / stressed cell (membrane rupture) Filter (MAD upper + tissue-aware ceiling; raise ceiling for mito-rich tissue)
n_genes_by_counts very low Empty droplet / debris Filter (min_genes ~200 + low-tail MAD)
total_counts very low Shallow / failed capture Filter (low-tail MAD; check barcode-rank knee)
n_genes_by_counts / total_counts very high Doublet (two cells in one droplet) Flag, run Scrublet/scDblFinder, then drop — don't hard-cap on counts alone
predicted_doublet (Scrublet) True Multiplet Filter per sample; cluster-then-drop if unsure
pct_counts_hb high RBC / blood contamination Filter in non-blood tissue; investigate in blood
pct_counts_ribo very high/low Low-complexity / stressed, or cell-type signal Flag / investigate (ribo is cell-type-specific; rarely a hard filter)
markers ubiquitous across clusters Ambient RNA contamination Investigate — run SoupX/DecontX, do not silently proceed
many cells, low median counts Empty droplets not removed Investigate — apply EmptyDrops / knee, re-filter

Full reasoning, MAD code, Scrublet/SoupX/EmptyDrops recipes, and order of operations: references/scrna_qc.md. Inline pipeline QC: references/scanpy_workflow.md Phase 2.


Complete Pipeline (Quick Reference)

import scanpy as sc

adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")

# QC
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

# Normalize + HVG + PCA
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata, n_comps=50)

# Cluster + UMAP
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)

# Find markers + Annotate + Per-cell-type DE
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')

Differential Expression Decision Tree

Single-Cell DE (many cells per condition):
  Use: sc.tl.rank_genes_groups(), methods: wilcoxon, t-test, logreg
  Best for: Per-cell-type DE, marker gene finding

Pseudo-Bulk DE (aggregate counts by sample):
  Use: R DESeq2 via `tu run run_deseq2_analysis` or Rscript (NOT pydeseq2 — gives different DEG counts)
  Best for: Sample-level comparisons with replicates

Statistical Tests Only:
  Use: scipy.stats (ttest_ind, f_oneway, pearsonr)
  Best for: Correlation, ANOVA, t-tests on summaries

Statistical Tests (Quick Reference)

from scipy import stats
from statsmodels.stats.multitest import multipletests

# Pearson/Spearman correlation
r, p = stats.pearsonr(gene_lengths, mean_expression)

# Welch's t-test
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)

# ANOVA
f_stat, p_val = stats.f_oneway(group1, group2, group3)

# Multiple testing correction (BH)
reject, pvals_adj, _, _ = multipletests(pvals, method='fdr_bh')

Batch Correction (Harmony)

import harmonypy
sc.tl.pca(adata, n_comps=50)
ho = harmonypy.run_harmony(adata.obsm['X_pca'][:, :30], adata.obs, 'batch', random_state=0)
adata.obsm['X_pca_harmony'] = ho.Z_corr.T
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)

ToolUniverse Integration

Data Discovery (before analysis)

  • CxGDisc_search_datasets: Search CELLxGENE Discover for scRNA-seq datasets by disease, tissue, organism. Use broad disease terms (e.g., "breast cancer" not "triple-negative").
  • GEO_search_rnaseq_datasets / geo_search_datasets: Search GEO for scRNA-seq studies
  • NCBI_SRA_search_runs: Search SRA for sequencing runs (query="single cell RNA-seq [disease]")
  • OmicsDI_search_datasets: Cross-repository dataset search

Cell Type Markers

  • CellMarker_search_by_cell_type: Tissue-specific cell markers (use CellMarker_list_cell_types first — exact names required, e.g., "Regulatory T(Treg) cell" not "Regulatory T cell")
  • CellMarker_search_cancer_markers: Cancer-context markers with experimental evidence
  • CellMarker_search_by_gene: Reverse lookup — which cell types express a gene?
  • HPA_search_genes_by_query: Cell-type marker gene search

Gene Annotation

  • MyGene_query_genes / MyGene_batch_query: Gene ID conversion
  • ensembl_lookup_gene: Ensembl gene details
  • UniProt_get_function_by_accession: Protein function

Cell-Cell Communication

  • OmniPath_get_ligand_receptor_interactions: L-R pairs (CellPhoneDB, CellChatDB)
  • OmniPath_get_signaling_interactions: Downstream signaling
  • OmniPath_get_complexes: Multi-subunit receptors

Enrichment (Post-DE)

  • PANTHER_enrichment: GO enrichment (BP, MF, CC)
  • STRING_functional_enrichment: Network-based enrichment
  • ReactomeAnalysis_pathway_enrichment: Reactome pathways

Clinical Context (for tumor immunology)

  • DGIdb_get_drug_gene_interactions: Drug interactions for immune checkpoint targets (genes=["CD274"] for PD-L1)
  • civic_search_evidence_items: Clinical evidence for mutations/biomarkers
  • TIMER2_immune_estimation: TCGA immune infiltration correlation
  • search_clinical_trials: Clinical trial matching
  • GTEx_get_expression_summary: Normal tissue baseline expression
  • PubMed_search_articles: Literature context

Scanpy vs Seurat Equivalents

Operation Seurat (R) Scanpy (Python)
Load data Read10X() sc.read_10x_mtx()
Normalize NormalizeData() sc.pp.normalize_total() + sc.pp.log1p()
Find HVGs FindVariableFeatures() sc.pp.highly_variable_genes()
PCA RunPCA() sc.tl.pca()
Cluster FindClusters() sc.tl.leiden()
UMAP RunUMAP() sc.tl.umap()
Find markers FindMarkers() sc.tl.rank_genes_groups()
Batch correction RunHarmony() harmonypy.run_harmony()

Reasoning Framework for Result Interpretation

Evidence Grading

Grade Criteria Example
High confidence Marker padj < 0.01, log2FC > 1, expressed in > 25% of cluster cells CD3D as T-cell marker with padj = 1e-50, log2FC = 3.2, pct = 0.85
Moderate confidence padj < 0.05, log2FC > 0.5, or expressed in 10-25% of cluster FOXP3 in Treg cluster with padj = 0.001, pct = 0.18
Low confidence padj < 0.05 but log2FC < 0.5 or low pct_diff between clusters Ubiquitously expressed gene with marginal enrichment
Unreliable Fewer than 20 cells in cluster, or QC metrics suggest doublets Cluster with mean nGenes > 6000 and high doublet score

Interpretation Guidance

  • QC metric thresholds: Standard filters are nGenes > 200 (remove empty droplets), nGenes < 5000-6000 (remove doublets), pct_counts_mt < 20% (remove dying cells). These thresholds are tissue-dependent: immune cells tolerate stricter nGene filters; neurons may have higher mitochondrial content naturally. Always visualize distributions before setting cutoffs.
  • Cluster resolution guidance: Leiden resolution 0.3-0.5 yields broad cell types (T cells, B cells, myeloid). Resolution 0.8-1.2 resolves subtypes (CD4 naive, CD4 memory, Treg). Resolution > 2.0 risks over-clustering (splitting biologically homogeneous populations). Validate by checking that each cluster has distinct marker genes.
  • Marker gene confidence levels: A strong marker is highly specific (high pct_diff between cluster and rest) and highly expressed (high log2FC). Genes expressed in many clusters with small fold changes are poor markers. Cross-reference with known markers from CellMarker or HPA databases.
  • Pseudo-bulk vs single-cell DE: For comparing conditions (treatment vs control), pseudo-bulk DE (aggregate by sample, then DESeq2) is more statistically valid than single-cell DE, which inflates significance due to non-independence of cells from the same sample.
  • Batch effects: If samples cluster by batch rather than biology on UMAP, apply Harmony or other correction before biological interpretation.

Synthesis Questions

  1. Do the identified clusters correspond to known cell types based on canonical markers, or do some clusters lack clear biological identity (potentially doublets or low-quality cells)?
  2. At the chosen clustering resolution, are there clusters that merge when resolution is lowered, suggesting they may be a single cell type split by technical noise?
  3. For differential expression between conditions, are the results consistent between single-cell and pseudo-bulk approaches, and do the top DE genes have known biological relevance?
  4. Do QC-flagged cells (high mito, extreme gene counts) concentrate in specific clusters, and does removing them change the clustering structure?
  5. If batch correction was applied, do post-correction clusters still maintain expected cell-type-specific marker expression?

Troubleshooting

Issue Solution
ModuleNotFoundError: leidenalg pip install leidenalg
Sparse matrix errors .toarray(): X = adata.X.toarray() if issparse(adata.X) else adata.X
Wrong matrix orientation More genes than samples? Transpose
NaN in correlation Filter: valid = ~np.isnan(x) & ~np.isnan(y)
Too few cells for DE Need >= 3 cells per condition per cell type
Memory error Use sc.pp.highly_variable_genes() to reduce features

Reference Documentation

Detailed Analysis Patterns: analysis_patterns.md (per-cell-type DE, correlation, PCA, ANOVA, cell communication)

Core Workflows:

  • references/scrna_qc.md - scRNA-seq QC gating (mito%, doublets, ambient RNA, empty droplets, MAD thresholds)
  • references/scanpy_workflow.md - Complete scanpy pipeline
  • references/seurat_workflow.md - Seurat to Scanpy translation
  • references/clustering_guide.md - Clustering methods
  • references/marker_identification.md - Marker genes, annotation
  • references/trajectory_analysis.md - Pseudotime
  • references/cell_communication.md - OmniPath/CellPhoneDB workflow
  • references/troubleshooting.md - Detailed error solutions

Analysis Conventions

DESeq2 library choice: match the authoritative script

If the data folder contains an authoritative script (run_*.py, analysis.R), use whichever DESeq2 library it uses (pydeseq2 or R DESeq2). The two libraries give slightly different DEG counts (~2-10% at the same thresholds), so matching matters. If no script exists, prefer R DESeq2 via the run_deseq2_analysis tool or Rscript:

tu run run_deseq2_analysis '{"operation":"deseq2","counts_file":"pseudo_bulk_counts.csv","metadata_file":"sample_meta.csv","design":"~ sex","contrast":"sex, M, F","lfc_shrinkage":true}'
Install via CLI
npx skills add https://github.com/mims-harvard/ToolUniverse --skill tooluniverse-single-cell
Repository Details
star Stars 1,459
call_split Forks 224
navigation Branch main
article Path SKILL.md
More from Creator
mims-harvard
mims-harvard Explore all skills →