name: kg-release-diff description: Produce a standardized semantic-modeling diff report between two KG-Microbe merged-KG releases. Use when comparing KG versions to quantify changes in biolink/METPO categories, predicates, relations, CURIE prefixes, knowledge sources, and predicate × category signatures.
KG-Microbe Release Semantic Diff
Purpose
Compare two merged-KG releases (e.g. data/merged/20260120 vs data/merged/20260422_nometatraits) along the axes that matter for semantic modeling, not just row counts:
- Schema — which columns appear in nodes/edges in each release
- Biolink / METPO categories — added, removed, count deltas; flags deprecated biolink categories still in use
- Predicates — added, removed, count deltas; biolink vs METPO vocabulary shift
- Relation column — added, removed, count deltas (RO / domain terms)
- CURIE prefix coverage — on node ids, edge subjects, edge objects
- Knowledge-source / provided_by distribution — which transforms contribute nodes/edges
- Predicate × (subject category, object category) signatures — the strongest structural fingerprint of modeling choices
The report is a single Markdown file suitable for PRs, release notes, or hand-off to ontology reviewers.
Usage
# Two release directories under data/merged/
poetry run python .claude/skills/kg-release-diff/kg_release_diff.py \
--old data/merged/20260120 \
--new data/merged/20260422_nometatraits \
--out data/merged/release_diff_20260120_vs_20260422_nometatraits.md
# Or provide explicit TSV paths (either can be gzipped)
poetry run python .claude/skills/kg-release-diff/kg_release_diff.py \
--old-nodes data/merged/old/merged-kg_nodes.tsv \
--old-edges data/merged/old/merged-kg_edges.tsv \
--new-nodes data/merged/new/merged-kg_nodes.tsv \
--new-edges data/merged/new/merged-kg_edges.tsv \
--old-label 20260120 \
--new-label 20260422_nometatraits \
--out release_diff.md
Flags
| Flag | Default | Purpose |
|---|---|---|
--old PATH / --new PATH |
— | Directory containing merged-kg_nodes.tsv + merged-kg_edges.tsv (accepts .gz) |
--old-nodes/--old-edges/--new-nodes/--new-edges |
— | Explicit file paths (overrides directory lookup) |
--old-label STR / --new-label STR |
directory name | Human-friendly labels that appear in the report header |
--out PATH |
stdout | Write the Markdown report here |
--max-rows N |
0 (no cap) | Row cap per file — useful for smoke tests |
--no-signatures |
false | Skip the predicate × category signature pass (saves ~150 MB RAM on large releases) |
When to invoke
- Preparing release notes for a new merged KG
- Validating that a transform change produced the intended modeling shift (e.g. biolink → METPO predicate preservation)
- Auditing deprecated biolink terms across releases
- Diagnosing which knowledge source contributes which edge types
How to report findings
When run interactively, after the script finishes:
- Read back the Summary table and the top 3–5 rows from each change list
- Call out any deprecated-biolink categories or unregistered-prefix warnings
- Surface the predicate vocabulary shift (biolink vs METPO %) as a one-sentence takeaway — this is often the headline
- Point the user at the full report path
The script emits progress lines to stderr ([scan] old nodes: …) so long runs don't look hung.
Implementation
kg_release_diff.py— streams both TSVs withcsv.DictReader(reads by column name, tolerating column order / count differences).- Counter-based aggregation keeps memory bounded regardless of file size, except for the node
id → categorymap used for predicate × category signatures (~150 MB for a 2 M-node release; disable with--no-signaturesif memory-constrained). - Deprecated biolink categories are listed in
BIOLINK_DEPRECATED_CATEGORIESnear the top of the script — update as the Biolink Model evolves.
Output structure
The generated Markdown has exactly these sections:
- Summary (one numbers table)
- Schema (columns) — nodes and edges subsections
- Biolink / METPO category usage — added/removed/delta tables + deprecated list
- Predicate usage — added/removed/delta tables + biolink-vs-METPO shift table
- Relation column usage
- CURIE prefix coverage — node id, subject, object subsections
primary_knowledge_sourcedistributionprovided_bydistribution on nodes- Predicate × (subject category, object category) signatures (omitted when
--no-signatures)
Each added/removed/delta table is capped at 40 rows with a …and N more footer.