name: hvantk:resource-clinvar description: Onboard, build, or update the ClinVar resource for hvantk status: provisional backend: hail domain: variants
ClinVar resource skill
1. Status & scope
This skill covers BUILD and UPDATE of the ClinVar Hail Table. It does NOT cover download (see hvantk/skills/clinvar/cli.py) or downstream analysis.
The builder is build_clinvar in hvantk/skills/clinvar/builder.py. It has the platform builder signature build_clinvar(parsed_input, ctx, *, reference_genome="GRCh38") -> AnnotationTable. There is no input_path / output_path / overwrite / export_tsv signature; the platform's run_builder_for_spec calls artifact.save() to materialize the table.
2. Source identity
Provider metadata (URL, version cadence, license, citation) lives in the ClinVar entry inside hvantk/resources/registry/genomics/datasets.json (find by accession: "ClinVar_latest"), surfaced via hvantk catalog show ClinVar_latest. Read that file; do not restate.
Stable note (not in catalog): ClinVar releases monthly. New INFO fields are uncommon but do occur (e.g., the addition of ONCOGENICITY in the oncogenicity supplement).
3. Backend choice + reasoning
Hail Table keyed by (locus, alleles). Reasoning: ClinVar is variant-keyed and joins with other variant tables (gnomAD, dbNSFP) downstream. A single Hail Table is the natural fit; nothing about ClinVar requires anndata or pandas.
4. Raw format & gotchas
- File format: bgzipped VCF (
.vcf.bgzwith.tbiindex). - Reference genome: GRCh38 by default. The
contig_recoding()helper fromhvantk/core/utils/genome.pymaps numeric contigs tochr*names. - Import: use
hl.import_vcf(path, force=True, reference_genome=..., contig_recoding=..., skip_invalid_loci=True). Theforce=Trueflag is required because ClinVar VCFs contain non-standard headers. - Keying: after
.rows(), repartition (typical: 100) and key by(locus, alleles). - INFO fields commonly used:
CLNSIG,CLNREVSTAT,CLNDN,CLNDISDB,RS,MC,GENEINFO. The full list comes from the VCF header — read it, do not assume. - Encoding gotchas:
- Spaces in INFO values are encoded as
_(e.g.,Likely_pathogenic). - Multi-value INFO fields use
,or|depending on the field. CLNDISDBuses,between databases and|between IDs within a database.
- Spaces in INFO values are encoded as
5. Output contract
AnnotationTable wrapping a Hail Table keyed by (locus, alleles), with provenance schema_id="clinvar-variants-v1". The platform writes it to the build --output path. Schema is defined by hvantk/skills/clinvar/tests/snapshots/schema.json (canonical). Human summary: row contains rsid, qual, filters, and an info struct with the ClinVar-specific INFO fields parsed by hl.import_vcf.
6. hvantk integration points
- Builder:
build_clinvarinhvantk/skills/clinvar/builder.py - Streamer:
ClinVarVariantTableStreamerinhvantk/skills/clinvar/streamers.py(subclass ofVariantTableStreamerinhvantk/core/streamers/) - Downloader CLI:
clinvar_downloaderinhvantk/skills/clinvar/cli.py(registered ashvantk clinvar-download; lifecycle download viadownload_datasetin the same module) - Dataset class:
ClinVarDatasetinhvantk/skills/clinvar/shared/datasets.py - Build CLI:
hvantk reprocess clinvar:variants --raw-dir <dir> --output <path>.ht(pass builder kwargs via--plugin-arg KEY=VALUE, e.g.--plugin-arg reference_genome=GRCh38). The plugin loader (hvantk/core/plugin/loader.py) resolves the dataset fromplugin.yamlviaget_registry().get_dataset("clinvar:variants"); the build runs throughrun_builder_for_spec(hvantk/core/plugin/run_builder.py). - Plugin manifest:
hvantk/skills/clinvar/plugin.yaml(drives loader registration; compound dataset keyclinvar:variants) - Test:
hvantk/skills/clinvar/tests/test_builder.py
Read the existing files at these paths as ground truth for shape. This skill does not restate code.
7. Workflow steps
When invoked to build or update:
- Verify Hail is available (defer to the SessionStart hook).
- Confirm input is a ClinVar VCF: read the
#CHROMheader line of the input file. - Build the table inline in
build_clinvar(no shared base helper):hl.import_vcf(str(parsed_input), force=True, reference_genome=reference_genome, contig_recoding=contig_recoding(), skip_invalid_loci=True).rows().repartition(100).key_by("locus", "alleles")- Wrap the result with
AnnotationTable.from_hail(ht, provenance=ctx.provenance(schema_id="clinvar-variants-v1")). The platform callsartifact.save()to write to disk.
- Apply the parsing rules from § 4 (force, contig_recoding, skip_invalid_loci).
- Run validation:
pytest hvantk/skills/clinvar/tests -m hail. - Report: schema diff, sample-row diff, test pass/fail.
8. Update playbook
When ClinVar releases a new monthly version:
- Fetch the new release:
hvantk clinvar-download --output-dir /tmp/clinvar. - Regenerate the fixture: extract a small representative slice (mix of CLNSIG values, multi-allelic site, multi-CLNDN row). The current pilot fixture is a single-chromosome slice (
hvantk/skills/clinvar/tests/testdata/raw/clinvar/clinvar_20220403_chr20.vcf.bgz) — this is adequate because ClinVar parsing is INFO-field driven, not chromosome-dependent. Replace the file (re-bgzip if needed). A.tbiindex is optional;hl.import_vcf(force=True)reads.bgzdirectly. - Run snapshot regeneration:
pytest hvantk/skills/clinvar/tests -m hail --regenerate-snapshots. - Inspect the snapshot diff:
- Expected diff (new INFO field, additional CLNSIG value): commit the regenerated snapshots with explanation.
- Unexpected diff (schema regression, missing field): STOP. Investigate before committing.
- If a parsing gotcha was introduced (e.g., new encoding), add it to § 4.
- Open PR; reviewer checks the snapshot diff narrative.
9. Validation contract
fixture:hvantk/skills/clinvar/tests/testdata/raw/clinvar/clinvar_20220403_chr20.vcf.bgzschema_snapshot:hvantk/skills/clinvar/tests/snapshots/schema.jsonrow_snapshot:hvantk/skills/clinvar/tests/snapshots/sample_rows.jsondrift_fingerprint:hvantk/skills/clinvar/tests/drift_fingerprint.jsontest_command:pytest hvantk/skills/clinvar/tests -m hail