name: hvantk:resource-cptac-expression description: Build an AnnData (samples x genes) from long-format CPTAC protein-expression matrices plus sample metadata. status: provisional backend: anndata domain: proteomics
CPTAC protein expression resource skill
Read hvantk/skills/_conventions/SKILL.md first. This skill assumes every convention there.
1. Status & scope
- Status: provisional. Builder + drift-probe placeholder live under the plugin folder. Builder round-trip snapshots are NOT yet seeded; the fixture directory exists but is empty. Plugin tests (drift-probe sanity) live in
hvantk/skills/cptac/expression/tests/. - In scope: one long-format CPTAC protein-expression TSV/CSV at a time, paired with a sample metadata file. Output is an AnnData (
.h5ad) keyedsamples x geneswithfloat32X. The user is responsible for staging the inputs -- there is no automated downloader for the expression matrix today (thecptacPython package fetches into memory for the phospho path; mirroring that for expression is a follow-up). - Out of scope: phosphoproteomics (sibling dataset
cptac:phospho); per-cancer-type batch builds (callers loop themselves); any Hail-Table or wide-format representation.
2. Source identity
- Provider: Clinical Proteomic Tumor Analysis Consortium (CPTAC) (https://proteomics.cancer.gov/programs/cptac).
- Catalog entry: TODO. No CPTAC entry exists yet in any plugin's
catalog/datasets.json(verify withhvantk catalog search cptac). The cancer-type enum + class map live inhvantk/skills/cptac/shared/constants.py(CPTAC_CANCER_TYPES,CPTAC_CANCER_CLASS_MAP). When a dedicated catalog entry lands underhvantk/skills/cptac/catalog/datasets.json, point this plugin'ssource.catalog_refat it and remove this TODO.
3. Backend choice + reasoning
backend: anndata, domain: proteomics. Per _conventions § 3, AnnData is the right fit for sparse-ish sample x feature matrices keyed by sample identifier. CPTAC protein expression is samples x genes, fits in memory for any single cancer cohort, and pairs naturally with sample-level clinical metadata in obs. A Hail Table would force a Spark session for a sub-100 MB matrix and lose the obs ergonomics.
4. Raw format & gotchas
The builder accepts a long-format table (one row per (sample, gene) observation) and pivots it to wide:
- Required columns (defaults):
GeneID,SampleID,Expression. Override viagene_id_col,sample_id_col,expression_col. - Optional:
Gene Name(column name configurable viagene_name_col) -- if present, it is propagated tovar. - Delimiter is auto-detected by
pd.read_csv(sep=None, engine="python").
Metadata file is read with the same auto-detect logic and indexed by SampleID (override via sample_id_col). Any extra columns become obs columns via a left join, so samples in the metadata that are absent from the expression matrix are silently dropped (the join is on wide.index).
X is coerced to np.float32. Missing values from the pivot become NaN and are preserved in X.
5. Output contract
- Artifact: an
ExpressionMatrix(AnnData-backed) returned by the builder; the platform persists it to.h5adat the--outputpath the caller passes tohvantk reprocess. - Shape:
(n_samples, n_genes)-- samples inobs, genes invar. X:float32, missing values asNaN.obs: indexed by sample id; columns mirror the metadata file (afterset_index).var: indexed by gene id (name =gene_id_col); containsgene_name_colif it was present in the input.uns["column_summary"]: added byannotate_column_summary_ad.- Provenance: stamped on the returned
ExpressionMatrixviactx.provenance(schema_id="cptac-expression-v1"); persisted by the platform as a sidecar.provenance.jsonnext to the.h5ad.
6. hvantk integration points
- Builder:
build_cptac_expression(parsed_input, ctx, **params)inhvantk/skills/cptac/expression/builder.py.parsed_inputis a{"expression": <tsv>, "metadata": <tsv>}dict;paramsaregene_id_col,gene_name_col,sample_id_col,expression_col. Returns anExpressionMatrix. - Shared helpers:
create_anndata_from_cptac_long,_parse_site_idinhvantk/skills/cptac/shared/cptac.py(shared with the sibling phospho builder). - Drift probe:
fetch_fingerprintinhvantk/skills/cptac/expression/drift_probe.py(fingerprints the installedcptacPython package version). - Plugin manifest:
hvantk/skills/cptac/plugin.yaml(compound dataset keycptac:expression). The loader auto-resolves the dataset viaget_registry().get_dataset("cptac:expression"); top-level builds run throughrun_builder_for_spec(hvantk/core/plugin/run_builder.py). - Tests: drift-probe sanity in
hvantk/skills/cptac/expression/tests/test_drift_probe.py. - CLI: end-to-end
hvantk reprocess cptac:expressionis not yet wired — the manifest does not declare alifecycle.parsestage to produce the{"expression": <tsv>, "metadata": <tsv>}parsed_inputdict. Until that lands, build via the Python API (build_cptac_expression(parsed_input, ctx, gene_id_col=…, sample_id_col=…, expression_col=…)).
7. Workflow steps
When invoked to build a CPTAC protein-expression AnnData:
- Stage inputs. Long-format expression TSV/CSV + sample metadata TSV/CSV. (Today the user produces these manually; an automated downloader for expression is a follow-up -- see
cptac:phosphofor the in-package fetcher pattern.) - Build. Import
build_cptac_expression(parsed_input, ctx, **params)directly — see § 6 on whyhvantk reprocess cptac:expressionis not wired yet. - Validate.
pytest hvantk/skills/cptac/expression/tests(drift-probe sanity).
8. Update playbook
CPTAC ships new cohorts and re-processed runs via the cptac Python package. When a new package release lands:
- Run the drift probe:
python -c "from hvantk.skills.cptac.expression.drift_probe import fetch_fingerprint; print(fetch_fingerprint())". A change insource_version(the installedcptacpackage version) is the trigger. - Re-stage / re-pull the expression matrix for any cohort of interest using the package's
get_proteomicsAPI (or whatever the new release exposes). - Re-build and spot-check sample / gene counts against the previous run.
- Re-regenerate
tests/drift_fingerprint.jsononce the package version is updated.
9. Validation contract
Per _conventions § 9:
- fixture:
hvantk/skills/cptac/expression/tests/testdata/raw/cptac-expression/(declared inplugin.yaml; not yet seeded). - schema_snapshot:
hvantk/skills/cptac/expression/tests/snapshots/schema.json(TODO -- created on first--regenerate-snapshotsrun). - row_snapshot:
hvantk/skills/cptac/expression/tests/snapshots/sample_rows.json(TODO -- same). - test_command:
pytest hvantk/skills/cptac/expression/tests. - drift_fingerprint:
hvantk/skills/cptac/expression/tests/drift_fingerprint.json(placeholder shape; refresh via the update playbook).
The plugin manifest declares these paths so the loader contract holds. Drift-probe sanity test passes today; the builder round-trip snapshot is the gap to close in a follow-up.
Snapshot status: schema.json and sample_rows.json have NOT yet been seeded for this plugin. On first round-trip run, use
pytest hvantk/skills/cptac/expression/tests --regenerate-snapshotsto bootstrap them, then commit. Until seeded, the round-trip test cannot verify output against a fixed schema.