single-cell-foundation-model-stofm

name: single-cell-foundation-model-stofm description: > Use this skill when a task involves the local SToFM project in /DATA/disk0/zhaosy/home/SToFM, especially preprocessing spatial transcriptomics data for SToFM, generating cell embeddings with the cell encoder plus SE(2) Transformer pipeline, handling spatial coordinates, or preparing SToFM embeddings for downstream region segmentation or cell type annotation.

Use this skill for the local SToFM repository at /DATA/disk0/zhaosy/home/SToFM. It is the right choice when the task involves:

preprocessing spatial transcriptomics data into the format expected by SToFM
converting mouse genes to the human Geneformer vocabulary when needed
generating cell embeddings with the official get_embeddings.py pipeline
working with spatial coordinates, sub-slice splitting, or hypernode construction
using SToFM embeddings for downstream region segmentation or cell type annotation
understanding the repo's two-stage architecture: cell encoder plus SE(2) Transformer

Do not use this skill for ordinary scRNA-seq analysis without spatial coordinates.

Confirm the data has usable spatial coordinates.
Check whether the input has already been preprocessed into both data.h5ad and hf.dataset.
Check that the required checkpoints exist for both the cell encoder and the SE(2) Transformer.
Prefer the official embedding pipeline before building downstream heads.

Use preprocessing/preprocess.py first unless the dataset is already in the expected SToFM format.

The repo's preprocessing flow:

starts from AnnData
expects Geneformer-style transcriptome tokenization
adds obs["n_counts"]
uses var["ensembl_id"]
maps mouse gene ids to human ids when needed
saves both:
- hf.dataset for the cell encoder
- data.h5ad for later spatial loading

The main official workflow is get_embeddings.py.

This path:

The repo's recommended downstream pattern is simple:

The README specifically highlights:

Do not skip spatial information; SToFM is not just a transcriptome encoder.
Do not treat hf.dataset alone as sufficient input for the full model; the SE(2) stage also needs spatial structure.
Do not assume mouse genes can be used directly; the repo explicitly maps them into the human vocabulary.
Do not run downstream heads on raw expression if the intended workflow is SToFM; generate official embeddings first.
Do not assume the repo ships checkpoints; they are external downloads.