name: hvantk:resource-clingen description: Onboard, build, or update the ClinGen Gene-Disease Validity resource for hvantk status: provisional backend: hail domain: genomics
ClinGen Gene-Disease Validity resource skill
Read hvantk/skills/_conventions/SKILL.md first. This skill assumes its repository map, helpers, keying conventions, builder pattern, and validation contract.
1. Status & scope
This skill covers BUILD and UPDATE of the ClinGen Gene-Disease Validity Hail Table. It does NOT cover download (see hvantk/skills/clingen/cli.py) or downstream streamer logic. The per-plugin streamer subclass lives in hvantk/skills/clingen/streamers.py (ClinGenGeneDiseaseTableStreamer), built on the shared base GeneDiseaseTableStreamer in hvantk/core/streamers/gene_disease_table.py.
2. Source identity
Provider metadata (URL, license, citation) lives in the plugin's catalog/datasets.json (or query it with hvantk catalog show <accession> once a ClinGen catalog entry exists). The URL/version constants live in hvantk/skills/clingen/shared/constants.py (CLINGEN_BASE_URL, CLINGEN_DOWNLOADS_URL, CLINGEN_FILE_PREFIX, CLINGEN_HEADER_SKIP_LINES). Read those files; do not restate.
Stable provider notes the catalog will not capture:
- ClinGen serves a single rolling Gene-Disease Validity CSV; there are no dated archives. Freshness is determined by the file's HTTP
Last-Modifiedheader (when present) and by the values in the leading metadata block. - The CSV is generated on demand from the live curation database, so two downloads minutes apart may differ.
3. Backend choice + reasoning
hail. ClinGen is small (~5k rows) and _conventions § 3 allows pandas for mapping tables, but every current consumer (ClinGenStreamer, gene-disease join points in PSROC, recipe-driven multi-table joins) reads it as a Hail Table — producing one avoids redundant materialization at every join site.
4. Raw format & gotchas
- File: comma-separated, double-quoted fields, 6-line metadata header before the column-header line (
CLINGEN_HEADER_SKIP_LINES = 6). The column header begins with"GENE SYMBOL". Separator rows containing++++++are interleaved with the metadata block. - Preprocessing: the builder streams the file via
hl.hadoop_openand writes a cleaned temp CSV containing only the column header + data rows. This is required becausehl.import_tablecannot skip arbitrary leading metadata. - Import:
hl.import_table(delimiter=",", quote='"', impute=False, min_partitions=10). All fields stay as strings; no type inference. - Field renaming is driven by
CLINGEN_GENE_DISEASE_FIELDS(hvantk/skills/clingen/shared/constants.py). Notable renames:GENE SYMBOL → gene_symbol,GENE ID (HGNC) → hgnc_id,DISEASE LABEL → disease_label,DISEASE ID (MONDO) → mondo_id,MOI → mode_of_inheritance,CLASSIFICATION → classification,GCEP → gene_curation_expert_panel. - ID prefix stripping: the builder strips the
HGNC:andMONDO:prefixes fromhgnc_idandmondo_id. This is asymmetric with the HGNC table (which keeps the prefix); downstream joins (e.g.,clingen_streamer) account for this. - Classification levels (
CLINGEN_CLASSIFICATION_LEVELS, strongest first):Definitive,Strong,Moderate,Limited,Disputed,Refuted. The builder annotatesclassification_levelas the numeric position (lower is stronger). Unknown values getlen(CLINGEN_CLASSIFICATION_LEVELS). - Keying: keyed by
(hgnc_id, mondo_id).
5. Output contract
Hail Table at <output_path>.ht, keyed by (hgnc_id, mondo_id).
Row schema includes: hgnc_id, gene_symbol, disease_label, mondo_id, mode_of_inheritance, sop, classification, classification_level, classification_date, gene_curation_expert_panel, report_url.
6. hvantk integration points
- Plugin manifest:
hvantk/skills/clingen/plugin.yaml(the loader resolves this dataset viaget_registry().get_dataset("clingen:gene-disease"); top-level builds run throughrun_builder_for_specinhvantk/core/plugin/run_builder.py). - Builder:
build_clingen_gene_diseaseinhvantk/skills/clingen/builder.py. Signature(parsed_input, ctx, *, min_classification=None, fields=None) -> AnnotationTable. The table is built inline (hl.import_table+ renames/transforms); the only shared helper iscleanup_temp_filefromhvantk/core/utils/hail_helpers.py. - Downloader CLI:
download_cmd(Clickclingen-download) inhvantk/skills/clingen/cli.py; lifecycle entry-pointdownload_dataset(raw_dir=...)(declared underlifecycle.downloadinplugin.yaml). - Dataset class:
ClinGenGeneDiseaseDatasetinhvantk/skills/clingen/shared/datasets.py. - Build CLI:
hvantk reprocess clingen:gene-disease --raw-dir <dir> --output <path>.ht(pass builder kwargs via--plugin-arg key=value, e.g.--plugin-arg min_classification=Strong). - Streamer (per-plugin):
ClinGenGeneDiseaseTableStreamerinhvantk/skills/clingen/streamers.py, subclass ofGeneDiseaseTableStreamer(hvantk/core/streamers/gene_disease_table.py). - Constants:
CLINGEN_BASE_URL,CLINGEN_DOWNLOADS_URL,CLINGEN_FILE_PREFIX,CLINGEN_HEADER_SKIP_LINES,CLINGEN_GENE_DISEASE_FIELDS,CLINGEN_CLASSIFICATION_LEVELSinhvantk/skills/clingen/shared/constants.py. - Tests:
hvantk/skills/clingen/tests/test_downloader.py,test_drift_probe.py,test_streamer.py.
7. Workflow steps
When invoked to build or update:
- Verify Hail is available (defer to the SessionStart hook).
- Confirm the raw CSV is present at
<raw_dir>/Clingen-Gene-Disease-Summary-<YYYY-MM-DD>.csv. If absent, run the plugin downloader (Clickclingen-download, e.g.hvantk download clingen --output-dir <raw_dir>). - The builder keys the output by
(hgnc_id, mondo_id). Optionally passmin_classificationto drop weaker associations, andfieldsto select a column subset. - Build via CLI:
hvantk reprocess clingen:gene-disease --raw-dir <dir> --output <path>.ht [--plugin-arg min_classification=Strong]. - Sanity-check the output: row count plausible (~5k associations live, 11 in fixture); key fields present; HGNC/MONDO prefixes stripped on at least one known row (e.g., BRCA1 →
hgnc_id == "1100", not"HGNC:1100"). - Run validation:
pytest hvantk/skills/clingen/tests -m hail.
8. Update playbook
When ClinGen publishes an updated snapshot (any download is effectively a new snapshot):
- Re-download:
hvantk download clingen --output-dir <raw_dir> --overwrite. Capture theLast-Modifiedheader (or the snapshot date label) in the PR description. - Diff the new CSV header against the previous fixture header. New columns alone are non-breaking — they will not appear in the built table unless added to
CLINGEN_GENE_DISEASE_FIELDS. Removed/renamed columns require updatingCLINGEN_GENE_DISEASE_FIELDS. - Drift-probe diff: regenerate
hvantk/skills/clingen/tests/drift_fingerprint.jsonand inspect for column-list changes or hash changes. - If the fixture (
hvantk/skills/clingen/tests/testdata/raw/clingen/clingen_test_sample.csv) is no longer representative (new classification value, new GCEP under test), refresh it from a curated sub-sample. - Re-run the plugin tests (
pytest hvantk/skills/clingen/tests -m hail); expected diffs: new optional columns, refreshedclassification_datevalues. Unexpected diffs (changed key semantics, stripped prefixes leaking back in) require investigation before commit. - Open PR; reviewer checks the diff narrative.
9. Validation contract
fixture:hvantk/skills/clingen/tests/testdata/raw/clingen/clingen_test_sample.csvdrift_fingerprint:hvantk/skills/clingen/tests/drift_fingerprint.jsontest_command:pytest hvantk/skills/clingen/tests -m hail
Snapshot status: the
tests/snapshots/schema.jsonandtests/snapshots/sample_rows.jsonpaths declared inplugin.yamlhave NOT yet been seeded for this plugin (thetests/snapshots/directory does not exist yet). Until seeded, there is no fixed-schema round-trip check; the live tests aretest_downloader.py,test_drift_probe.py, andtest_streamer.py.