single-cell-foundation-model-scgpt

name: single-cell-foundation-model-scgpt description: > Use this skill when a task involves the local scGPT project in /DATA/disk0/zhaosy/home/scGPT, especially scGPT preprocessing and binning, checkpoint vocabulary matching, cell embedding extraction, reference mapping, fine-tuning scGPT for integration or annotation, or using scGPT tutorials for GRN, perturbation, multiomics, and reference mapping workflows.

Use this skill for the local scGPT repository at /DATA/disk0/zhaosy/home/scGPT. It is the right choice when the task involves:

preparing AnnData inputs with scGPT's own preprocessing pipeline
matching genes to a pretrained scGPT vocabulary
tokenizing binned expression inputs for transformer models
extracting cell embeddings with pretrained checkpoints
fine-tuning scGPT for integration or annotation-style downstream tasks
understanding how scGPT expects binned values, special tokens, and batch labels
working through scGPT tutorials such as integration, annotation, GRN, perturbation, or reference mapping

Do not use this skill for generic Scanpy work that does not depend on scGPT checkpoints or tokenization.

Confirm the checkpoint directory contains args.json, vocab.json, and best_model.pt.
Decide whether the task is fine-tuning, embedding extraction, or tutorial-guided experimentation.
Run preprocessing before tokenization or embedding unless the input has already been prepared for scGPT.
Check vocabulary overlap before spending time on training or inference.

The core preprocessing path in this repo is scgpt.preprocess.Preprocessor. Typical steps include:

The clearest end-to-end example in the local repo is examples/finetune_integration.py. It demonstrates:

If the user asks "how should I use scGPT on my AnnData?", this example is often the best starting point.

Use scgpt.tasks.cell_emb.embed_data(...) or related functions when the goal is to generate cell embeddings from a pretrained model directory.

This path:

The local repo ships tutorials for:

Use them when the user wants the project-supported path instead of building a custom pipeline from scratch.

Do not assume arbitrary gene identifiers will work. scGPT inference depends on the checkpoint vocabulary.
Do not skip binning if the target workflow expects X_binned.
Do not mix raw, normalized, and logged layers casually; keep track of which layer is used at each stage.
Do not assume all checkpoints use the same configuration; always read args.json.
When extracting embeddings, verify the gene_col used to map genes into the vocabulary.