name: hvantk:resource-msigdb description: Build a Hail Table from an MSigDB GMT gene-set file (e.g., C2 Canonical Pathways) for enrichment / burden / overlap analyses. status: provisional backend: hail domain: mapping
MSigDB (Molecular Signatures Database)
Read hvantk/skills/_conventions/SKILL.md first. This skill assumes every convention there.
1. Status & scope
- Status: provisional. The builder, fixture, snapshots, and round-trip test exist; this skill is the design contract and update reference.
- In scope: any GMT file from MSigDB → single Hail Table keyed by
set_name. Verified against the C2 Canonical Pathways human gene-symbols collection (c2.cp.v2026.1.Hs.symbols.gmt, 4,115 sets, 1.6 MB). - Out of scope: downloader (MSigDB requires login + license click-through, so per
_conventions§ 11 acquisition is manual); other MSigDB collections (H, C1, C3-C8) — the builder is collection-agnostic but each collection should be tracked in the catalog separately if onboarded; gene-symbol normalization / alias resolution (use theHGNCGeneCatalogStreamerinhvantk/skills/hgnc/streamers.pydownstream); cross-format variants beyond GMT (GMX, XML).
2. Source identity
- Provider: Broad Institute MSigDB. Variant pinned by this skill's catalog entry: C2 / CP / human / gene symbols / v2026.1.
- Catalog entry: present.
hvantk/resources/registry/genomics/datasets.jsoncontainsMSigDB_C2_CP_v2026.1.Hs.symbols(surfaced viahvantk catalog show MSigDB_C2_CP_v2026.1.Hs.symbols). URLs / cadence / license / citation live in the registry entry — not here.
Stable note (not in catalog): MSigDB ships per-collection GMT files. The GMT format is the same across collections, so this builder works for any MSigDB GMT, but each onboarded collection needs its own catalog entry to record license / version / file path.
3. Backend choice + reasoning
backend: hail, domain: mapping. A C2 CP GMT is ~4k rows × variable-width gene columns. Per _conventions § 3 "Lookup / mapping" allows a Hail Table or pandas DataFrame. Hail wins here because downstream consumers (enrichment / burden / overlap, e.g., hvantk/enrichex/) join against Hail Tables keyed on gene symbols — producing a Hail Table avoids re-materialization at every join site, mirroring the HGNC decision. Key by set_name (string, unique-in-file).
Catalog placement note: this skill's catalog entry lives in
registry/genomics/datasets.json, not a dedicatedmapping/registry directory (no such directory exists today). The decision is consistent withGWAS_Catalog_v1.0_*(also gene-symbol / variant-adjacent annotation curation) and avoids invasive changes tohvantk/resources/unified_registry.py. Revisit if amapping/domain is later introduced.
4. Raw format & gotchas
GMT is tab-separated with variable-width rows:
- Column 1:
set_name(string, unique within a single GMT — e.g.,KEGG_APOPTOSIS). - Column 2: description text. For MSigDB-issued GMTs this is always a gsea-msigdb URL like
https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KEGG_APOPTOSIS. The spec allows arbitrary text, so the builder preserves it as-is insource_urlwithout parsing. Confirmed against the local 4,115-row source: 100% of rows had ahttps://www.gsea-msigdb.org/URL in column 2. - Columns 3..N: gene members. For human gene-symbol GMTs (
.Hs.symbols.gmt), these are HGNC-approved gene symbols. For other variants (.entrez.gmt,.Mm.symbols.gmt) the contents differ; the builder is symbol-agnostic — it carries strings.
Variable-width-row gotcha (this is the key deviation): hl.import_table rejects rows with inconsistent column counts. Instead, the builder uses hl.import_lines (one row per line, single text: str field) and splits on \t inside the transform. Gene members are sliced as parts[2:] into an array<str>. Set lengths in the source range from 5 (MSigDB minimum) to 1,497 genes; Hail arrays handle this without issue.
Empty-line gotcha: hl.import_lines does yield rows for blank lines (with text == ""). The transform filters set_name == "" after split as a defensive guard.
No type coercions needed — every field is string or array<str> by design.
5. Output contract
- Object: an
AnnotationTable(hvantk/core/models) wrapping a Hail Table, built viaAnnotationTable.from_hail(...). Thereprocessrunner checkpoints it to--output(a.htdirectory). - Key:
[set_name](string, unique-in-table). - Provenance: stamped via
ctx.provenance(schema_id="msigdb-genesets-v1"); persisted as a sidecar.provenance.json. - Fields:
set_name: str— gene-set identifier (e.g.,KEGG_APOPTOSIS).source_url: str— GMT column 2, verbatim. For MSigDB-issued GMTs this is ahttps://www.gsea-msigdb.org/...URL.genes: array<str>— gene members. Order is preserved from the source file.
- Reference genome: N/A. Gene-set membership is genome-independent; the catalog entry records
GRCh38only because the schema requires it.
Per _conventions § 9, set names are unique-in-table for a single GMT, so no sample_keys.json is maintained — the snapshot test reads keys directly from a small inline list. The post-#101 conventions §9 "unique key, no workaround" rule applies cleanly here.
6. hvantk integration points
- Builder:
build_msigdb_genesetsinhvantk/skills/msigdb/builder.py. Signature:(parsed_input, ctx) -> AnnotationTable. The table is built inline (hl.import_lines+split/select/key_by) — there is no_create_table_basehelper and nooutput_path/overwritekwargs. The shared temp helper, if needed, iscleanup_temp_fileinhvantk/core/utils/hail_helpers.py. - Registry: declared via the plugin manifest at
hvantk/skills/msigdb/plugin.yaml(datasetgenesets). The plugin loader (hvantk/core/plugin/loader.py) auto-resolves it from the manifest viaget_registry().get_dataset("msigdb:genesets"); there is noTABLE_BUILDERS/MATRIX_BUILDERSregistry or adapter. Top-level builds run throughrun_builder_for_spec(hvantk/core/plugin/run_builder.py). - CLI:
hvantk reprocess msigdb:genesets --raw-dir <dir> --output <path>.ht --skip-download(msigdb declares nolifecycle.download, so--skip-downloadis always required;<dir>must contain the unzipped.gmt). The builder takes no--plugin-argparams; no reference-genome arg — gene-set membership is genome-independent. - Catalog wiring: see § 2. Downloader: out of scope (manual acquisition).
7. Workflow steps
- Resolve raw path.
parsed_inputis the unzipped.gmtpath (resolved by the reprocess runner from--raw-dir). Acquire from https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp (login required). - Import.
hl.import_lines(paths=str(parsed_input), min_partitions=4)— yields one row per line withtext: str. - Transform (inline in
build_msigdb_genesets):parts = ht.text.split("\t").set_name = parts[0],source_url = parts[1],genes = parts[2:](Hail array slice).ht.filter(ht.set_name != "")— defensive against blank lines.ht.key_by("set_name").
- Wrap + return.
AnnotationTable.from_hail(ht, provenance=ctx.provenance(schema_id="msigdb-genesets-v1")). Checkpointing to--outputand provenance sidecar are handled by the reprocess runner.
8. Update playbook
MSigDB releases ~annually (versioned v<year>.<n>, e.g., v2026.1, v2025.1). Per release:
- Acquire the new GMT (manual download). Update the file path / version in the corresponding
registry/genomics/datasets.jsonentry; bumpaccession(MSigDB_C2_CP_v2026.1.Hs.symbols→MSigDB_C2_CP_v2027.1.Hs.symbols). - Re-run the round-trip test (§ 9). If it passes, no builder change.
- The GMT format has been stable for ~15 years; column 1 / column 2 / variable-tail shape has not changed. If MSigDB ever changes the description column (column 2) away from a URL, the
source_urlfield name becomes misleading — rename todescriptionand update this skill. - To onboard a different collection (e.g., C5 GO, H Hallmark): add a new catalog entry with the new accession; the same
build_msigdb_genesetsbuilder works without modification. Add a parallel fixture and snapshot directory if the new collection has structural quirks (e.g., GMTs with embedded null bytes).
9. Validation contract
Per _conventions § 9:
- fixture:
hvantk/skills/msigdb/tests/testdata/raw/msigdb/c2.cp-sample.gmt. 20 gene sets, ~24 KB, sampled from the v2026.1 C2 CP source by picking representative rows by line index (the GMT format is line-oriented, so a deterministic line subset is a valid sub-GMT). Exercises the short edge (size 5: BIOCARTA, SA), medium sets (60-330 genes), a long set (REACTOME_CELL_CYCLE, 688 genes), and the extra-long tail (REACTOME_POST_TRANSLATIONAL_PROTEIN_MODIFICATION, 1,497 genes). All 20 fixture rows have ahttps://www.gsea-msigdb.org/URL in column 2 (matches the live-file invariant). - schema_snapshot:
hvantk/skills/msigdb/tests/snapshots/schema.json. - row_snapshot:
hvantk/skills/msigdb/tests/snapshots/sample_rows.json.set_namekeys are unique-in-table, so nosample_keys.jsonis maintained per_conventions§ 9 (post-#101). The round-trip test inlines the small key list. - test_command:
pytest hvantk/skills/msigdb/tests -m hail.
Round-trip test asserts: the built AnnotationTable schema matches schema.json; deterministic sorted row slice matches sample_rows.json. Regenerate snapshots when the schema changes (rare — see § 8).