dismech-compliance

star 26

Skill for analyzing and improving compliance in the dismech knowledge base. Use this skill when checking disorder file completeness, identifying missing fields (ontology terms, evidence, descriptions), understanding weighted priority scoring, and systematically improving knowledge base coverage.

monarch-initiative By monarch-initiative schedule Updated 12/15/2025

name: dismech-compliance description: > Skill for analyzing and improving compliance in the dismech knowledge base. Use this skill when checking disorder file completeness, identifying missing fields (ontology terms, evidence, descriptions), understanding weighted priority scoring, and systematically improving knowledge base coverage.

DisMech Compliance Analysis Skill

Overview

Analyze and improve the completeness of disorder YAML files in the dismech knowledge base. The compliance system checks for recommended fields (ontology terms, evidence items, descriptions) and generates scores to identify priority curation targets.

When to Use

  • Running compliance checks on disorder files
  • Identifying missing recommended fields
  • Understanding which files need the most curation work
  • Improving overall knowledge base quality
  • Generating compliance dashboards and reports
  • Understanding field priority (weighted scoring)

Key Commands

Analyze Single File

just compliance kb/disorders/Asthma.yaml

Output includes:

  • Global Compliance: Percentage of recommended fields populated
  • Weighted Compliance: Score adjusted by field importance
  • Summary by Slot: Compliance grouped by field type (term, evidence, description)
  • Aggregated Scores by List Path: Breakdown for nested structures
  • Detailed Path Scores: Individual field status (OK/MISSING)

Analyze All Files

just compliance-all

Multi-file report showing:

  • Overall knowledge base compliance
  • Per-path compliance across all files
  • Quick identification of systematically missing fields

Weighted Analysis with Thresholds

just compliance-weighted

Uses conf/qc_config.yaml to:

  • Apply importance weights to different fields
  • Flag violations where compliance falls below minimum thresholds
  • Prioritize critical fields (phenotype terms, cell types, disease terms)

Generate Reports

# CSV format for spreadsheet analysis
just compliance-csv

# JSON format for programmatic processing
just compliance-report

Generate Visual Dashboard

just gen-dashboard

Creates dashboard/index.html with:

  • Interactive charts showing compliance distribution
  • Priority curation targets (10 lowest-scoring files)
  • Trend analysis across the knowledge base

Graph-Derived Metrics (Connectivity)

Recommended-slot compliance only measures whether fields are populated on an object. It cannot express cross-object graph properties — most importantly, whether a phenotype is actually wired into the causal pathograph. A phenotype can have a perfect HPO term, evidence, and description (full compliance credit) yet still float as a disconnected node, because the edge that connects it lives on a different object's downstream list.

dismech.qc_plugins fills this gap with a graph-derived QC metric computed from build_causal_graph() and emitted as an AggregatedPathScore (path phenotypes[].causal_inlink), so it composes with weighted compliance and conf/qc_config.yaml weights/thresholds like any other field. It is graded coverage, not a binary gate: a file with 9/12 phenotypes wired in scores 75%.

# Per-file connectivity coverage across the KB (lists files with gaps)
just compliance-connectivity

# Show the floating phenotype node names
just compliance-connectivity --list-unconnected

# Fail CI if aggregate coverage drops below a percentage
just compliance-connectivity --fail-under 30

A phenotype counts as connected when at least one causal edge (causes / leads_to predicate) targets it. To fix a floating phenotype, add the phenotype's name as a downstream: [{target: <phenotype name>}] on the upstream pathophysiology node. The QCMetricPlugin protocol in src/dismech/qc_plugins.py is the generic seam for further graph-derived metrics (orphan-target rate, gene-to-mechanism wiring, dead-end nodes).

Understanding Compliance Scores

Global vs Weighted Compliance

Metric Description
Global Compliance Simple percentage: populated fields / total recommended fields
Weighted Compliance Adjusted by field importance from conf/qc_config.yaml

Field Weights (from qc_config.yaml)

Field Weight Min Threshold Why
disease_term.term 5.0 95% Root disease identity - always required
phenotypes[].phenotype_term.term 3.0 90% Core clinical data
pathophysiology[].cell_types[].term 3.0 85% Mechanistic understanding
treatments[].treatment_term.term 2.5 80% Clinical relevance
term (general) 2.0 80% All ontology bindings
pathophysiology[].evidence 2.0 80% Scientific backing
evidence (general) 1.5 - Valuable but not always required
description 0.5 - Nice-to-have context

Compliance Status Values

Status Meaning
OK Field is populated
MISSING Recommended field is empty/absent

Improving Compliance

Priority Order

Address fields in this priority order based on weights:

  1. disease_term.term (weight 5.0) - Add MONDO term for the disease
  2. phenotypes[].phenotype_term.term (weight 3.0) - Add HPO terms to phenotypes
  3. pathophysiology[].cell_types[].term (weight 3.0) - Add CL terms to cell types
  4. treatments[].treatment_term.term (weight 2.5) - Add MAXO terms to treatments
  5. pathophysiology[].evidence (weight 2.0) - Add PMID-backed evidence
  6. General descriptions (weight 0.5) - Add explanatory text

Common Fixes

Missing disease_term.term

disease_term:
  preferred_term: Asthma
  term:
    id: MONDO:0004979
    label: asthma

Look up: uv run runoak -i sqlite:obo:mondo info "asthma"

Missing phenotype_term.term

phenotypes:
- name: Wheezing
  phenotype_term:
    preferred_term: Wheezing
    term:
      id: HP:0030828
      label: Wheezing

Look up: uv run runoak -i sqlite:obo:hp info "l~wheezing"

Missing cell_types[].term

cell_types:
- preferred_term: Mast cells
  term:
    id: CL:0000097
    label: mast cell

Look up: uv run runoak -i sqlite:obo:cl info "l~mast cell"

Missing treatment_term.term

treatments:
- name: Inhaled corticosteroids
  treatment_term:
    preferred_term: corticosteroid therapy
    term:
      id: MAXO:0000653
      label: corticosteroid therapy

Look up: uv run runoak -i sqlite:obo:maxo search "corticosteroid"

Missing evidence

evidence:
  - reference: PMID:12345678
    supports: SUPPORT
    snippet: "Exact quote from abstract"
    explanation: "Why this supports the claim"

Batch Improvement Workflow

1. Identify Lowest-Scoring Files

just gen-dashboard
# Check dashboard/index.html for "Priority Curation Targets"

Or use compliance-all and sort:

just compliance-report | jq -r '.files | sort_by(.weighted_compliance) | .[:10] | .[].file'

2. Check Threshold Violations

just compliance-weighted 2>&1 | grep "VIOLATION"

3. Systematic Field Addition

For systematically missing fields across many files:

import yaml
import glob

# Example: Find files missing disease_term.term
for f in glob.glob("kb/disorders/*.yaml"):
    with open(f) as file:
        data = yaml.safe_load(file)
    dt = data.get('disease_term', {})
    if not dt.get('term'):
        print(f"{f}: missing disease_term.term")

4. Validate After Changes

# Schema validation
just validate kb/disorders/MyDisease.yaml

# Term validation (labels match ontology)
just validate-terms-file kb/disorders/MyDisease.yaml

# Re-check compliance
just compliance kb/disorders/MyDisease.yaml

Configuration

qc_config.yaml Structure

# Default for unconfigured fields
default_weight: 1.0
default_min_compliance: null

# Per-slot config (applies everywhere that slot appears)
slots:
  term:
    weight: 2.0
    min_compliance: 80.0

# Per-path config (overrides slot config for specific locations)
paths:
  "phenotypes[].phenotype_term.term":
    weight: 3.0
    min_compliance: 90.0

Customizing Weights

Edit conf/qc_config.yaml to:

  • Increase weight for critical fields in your workflow
  • Set min_compliance thresholds to enforce standards
  • Add new paths for specific validation requirements

Integration with Other Skills

  • Use dismech-terms skill when adding ontology term bindings
  • Use dismech-references skill when adding evidence items
  • Run just qc after improvements for full validation

Troubleshooting

"Weighted Compliance" differs significantly from "Global Compliance"

This indicates your important fields (high weight) have different coverage than low-priority fields. Focus on improving high-weight fields first.

Many MISSING descriptions

Descriptions have low weight (0.5) and no minimum threshold. Address these last, or not at all if not needed.

Threshold violations blocking CI

Check conf/qc_config.yaml for min_compliance settings. Either:

  1. Improve the field coverage to meet the threshold
  2. Lower the threshold if it's too aggressive

Dashboard not generating

Ensure the dashboard directory exists and you have write permissions:

mkdir -p dashboard
just gen-dashboard
Install via CLI
npx skills add https://github.com/monarch-initiative/dismech --skill dismech-compliance
Repository Details
star Stars 26
call_split Forks 3
navigation Branch main
article Path SKILL.md
More from Creator
monarch-initiative
monarch-initiative Explore all skills →