name: audit-schema-gaps description: Systematic gap-and-inconsistency audit for the TraitMech LinkML schema, the trait YAML corpus, and the scripts that generate them — produces a re-runnable validation harness, schema + writer audits, and a prioritized fix backlog. Invoke when you suspect schema drift, when records are silently failing validation, after a new seeder/migration pass, or whenever the user asks to "audit the schema", "find data quality issues", or similar. version: 1.0.0 tags: [validation, linkml, schema, data-quality, audit, qc] author: TraitMech Team created: 2026-05-19
Audit schema gaps (schema · instances · pipeline)
Why this skill exists
TraitMech's default just validate-all target runs linkml-validate per-file in open-schema mode and swallows non-zero exits — so unknown fields (the most common form of seeder drift) silently pass. Use this skill when you need a strict, all-files-in-one-pass health check, plus structural audits over the schema and the writers that emit YAML.
This skill is the deep counterpart to schema-gap-analysis (the lightweight quick-check). Use schema-gap-analysis for a ~5-min smoke test; use this one when you suspect systemic drift or before a release.
The skill is built around three orthogonal lenses:
- Instance-record validation — every YAML under
data/traits/**is validated withlinkml-validatein closed mode (unknown fields rejected). Errors are categorized into a TSV. - Schema audit — programmatic probes over
src/traitmech/schema/traitmech.yamlfor identifier policy, untypedstringslots, divergent term-field naming, inconsistentrequired:, orphan enums, and range references to undefined types. - Pipeline / writer audit — every Python module that writes a YAML is checked for: appends to
curation_history?, has a write-safeguard (--dry-runopt-out or--apply/--writeopt-in)?, validates before writing?, wired into ajusttarget?
Output is reports under reports/ plus a re-runnable validation harness at scripts/validate_strict.py, all version-controlled.
When to use
| Trigger | Why |
|---|---|
| User asks to "audit the schema / records / pipeline" | Direct ask. |
| User reports records "silently fail" or "validation passes but the data is wrong" | The signature symptom of open-schema validation. |
| A new seeder/migration is added | Run after to confirm it didn't introduce drift. |
Schema changes (src/traitmech/schema/traitmech.yaml modified) |
Refresh the audits against the new schema. |
| You're picking up a TraitMech repo cold and want to know its actual health | The harness gives a single-number answer (files with ERROR). |
Don't invoke this skill for:
- Single-file validation — use
just validate FILEdirectly (open-mode, fast). - A 5-minute quick check — use the
schema-gap-analysisskill instead. - Tweaking a renderer / page template — out of scope.
Required tooling
All already present in the repo; no new dependencies needed:
uv(Python runner)linkml-validateCLI andlinkml.validator.ValidatorPython API (already a dev dep)just(existing target wrapper)- Standard
pyyaml
If scripts/validate_strict.py is missing, this skill recreates it from spec (see "Step 1" below).
Workflow
The skill is a five-step pipeline. Each step produces an artifact; later steps reuse earlier outputs. Re-run independently as needed.
Step 1 — Strict validation harness
File: scripts/validate_strict.py
Just target: just validate-strict
Critical implementation requirements (these are what just validate-all gets wrong):
- Use
linkml.validator.Validatorin-process (much faster than subprocess per-file), not the CLI. - Configure with
JsonschemaValidationPlugin(closed=True)so unknown fields are flagged. This is the central correctness requirement. Withoutclosed=True, unexpected-field errors hide. - Parallel via
ProcessPoolExecutorwithncpu - 1workers; per-worker singleton Validator (init once, validate many). - Classify each message into a category via narrow regexes:
unexpected_field,missing_required,enum_mismatch,type_mismatch,pattern_mismatch,format_mismatch,range_violation, and a catch-allother.
- TraitMech has a single root class (
TraitRecord); validate every record against it. (No class-routing needed — unlike CultureMech, which routes between MediaRecipe and SolutionRecipe.) - Output TSV with columns
file,category,detail,path,message. Uselineterminator="\n"to avoid CRLF on macOS. - Exit code 1 if any ERROR rows; 0 if clean. Don't ever exit 0 on errors.
- Flags:
--sample N,--out PATH,--workers N,--quiet,--fail-on=error|never.
Smoke-test on --sample 5 before any full-corpus run.
Step 2 — Full-corpus validation
just validate-strict
Walks every data/traits/**/*.yaml. Currently ~357 files; runs in ~10 s on 9 workers.
Outputs:
reports/instance_validation_failures.tsv— one row per ERROR.- Console summary by category.
If the previous run was clean and this one isn't, the recent commits did it. git log -- data/traits/ src/traitmech/schema/ is your starting point.
Step 3 — Schema probes
File: scripts/audit_schema.py
Run: uv run python scripts/audit_schema.py > /tmp/schema_probes.md
Probes (all programmatic, no LLM):
| Probe | What it finds |
|---|---|
Classes without identifier: true slot |
Descriptors with no stable cross-reference handle. (Sub-objects like CausalEdge, EvidenceItem are expected to lack one; the value is in flagging root-level classes that drop it.) |
Slots with range: string whose name suggests enum/term |
Drift opportunities — e.g. *_mode, *_type, *_kind. |
| Term/ontology slot naming divergence | term_kind vs node_id vs predicate_id vs graph_id — when they should be uniform. |
required: true inconsistency for analogous attributes |
E.g. evidence required on CausalEdge but optional on TraitRecord. |
Orphan enums (declared but never used as range:) |
Dead schema. |
range: references to undefined classes/types/enums |
Broken schema. |
| Enum casing audit | Mixed UPPER/lower/mixed values within a single enum. |
Hand-compose reports/schema_gap_audit.md from the probe output. The composition is the value-add — explaining why each finding matters and citing instance counts from Step 2.
Step 4 — Pipeline / writer audit
File: scripts/audit_writers.py
Run: uv run python scripts/audit_writers.py --out reports/pipeline_writers_audit.tsv
Walks scripts/ and src/traitmech/. For each module that writes YAML (heuristic: yaml.safe_dump / yaml.dump / .write_text( with a .yaml path hint), records:
appends_curation_history— regex match oncuration_history.*appendorrecord_curation_eventor'curator':has_write_safeguard— regex match on--dry-run/dry_run\s*[:=]OR the safer opt-in conventions--apply/args.apply/--write/args.writevalidates_before_write— regex match onlinkml-validateorTraitValidatororvalidator.validate(wired_into_just— filename appears injustfile
Hand-compose reports/pipeline_gap_audit.md from the TSV. Highlight writers missing safeguards, especially the seeder (scripts/seed_from_metpo.py) since it's the only path that creates new trait YAMLs.
Heads-up false positives: audit_writers.py itself and render_trait_pages.py will match the writer heuristic but aren't trait-YAML writers (the auditor reads, the renderer writes HTML). Note in the report rather than re-tune the regex.
Step 5 — Prioritized fix backlog
Compose reports/gap_fix_backlog.tsv (machine-readable) + reports/gap_fix_backlog.md (narrative). One row per actionable gap:
| column | example |
|---|---|
| id | G01 |
| category | pipeline / schema / instance |
| title | Add write-safeguard to a writer flagged has_write_safeguard=no |
| impact | Prevents accidental over-write of curator edits |
| effort | S / M / L |
| suggested_fix_path | scripts/<writer>.py |
| blocking | comma-separated upstream G-ids |
Rank by impact × (1/effort). Group narrative by tier so an implementer can pick the easiest big-wins first. Always lead with G01: enable the CI gate — without it, every other fix can regress on the next merge.
Outputs (the deliverable surface)
scripts/
validate_strict.py # harness (in-process closed-schema validator)
audit_schema.py # schema probes
audit_writers.py # writer audit
reports/
instance_validation_failures.tsv # one row per ERROR (regenerable)
instance_validation_summary.md # human report; counts + drivers
schema_gap_audit.md # human report on schema findings
pipeline_writers_audit.tsv # writer/script audit (regenerable)
pipeline_gap_audit.md # human report on pipeline gaps
gap_fix_backlog.tsv # backlog rows
gap_fix_backlog.md # narrative, ranked
justfile
validate-strict # new target wrapping the harness
audit-schema # convenience target around audit_schema.py
audit-writers # convenience target around audit_writers.py
.github/workflows/
validate-strict.yaml # CI gate on PRs touching schema or YAMLs (optional)
How to follow up an audit with cleanup
The audit finds drift; cleanup fixes it. Typical post-audit work, in dependency order:
- CI gate first (G01). Without it, the rest can regress. Add
.github/workflows/validate-strict.yamlcallingjust validate-strict. - Bulk field renames (if instance-axis errors appear). Each is a one-file Python script that mirrors the pattern in the schema-gap-analysis skill: idempotent, appends a
CurationEvent, supports--dry-run. - Schema-shape fixes. If a finding is "schema is too narrow for real data", broaden the schema (e.g. extend
pattern:to admit additional CURIE prefixes) rather than migrate the data. - Regenerate dataclasses after schema changes:
just gen-schema. Always commit the regenerated file in the same commit as the schema change. - Revalidate. Re-run
just validate-strict, confirm the failing count drops, updatereports/instance_validation_summary.mdwith before/after numbers.
Anti-patterns (don't do these)
- Don't use the existing
just validate-allto decide if the corpus is clean. It runs the CLI in open mode and ignores per-file exit codes; it will report success even when records have unknown fields. Usejust validate-strict. - Don't add
# noqa/ try/except blocks around validator results. If a record fails, the right move is to fix the schema or migrate the data, not silence the failure. - Don't pile findings into a single dump. The split (instance / schema / pipeline / backlog) is so a reader can scan one section at a time.
- Don't run a full corpus pass when investigating a single regression. Use
just validate-strict --sample 20or pass the specific file path. - Don't broaden a regex pattern in the schema to "make tests pass" without understanding what the new values mean. Bake the explanation into the schema description.
- Don't forget to commit the regenerated
traitmech_dataclasses.pyalongside any schema change.
Write-time helpers (use in new and existing writer scripts)
Two shared modules close the audit loop by gating writes through the same closed-schema check the harness uses:
src/traitmech/validation/write_validated.py—write_validated_trait(doc, path)refuses to dump if the doc fails closed-schema validation. Pair it with the harness: harness catches existing drift on disk; helper prevents new drift from being written.src/traitmech/curate/curation_event.py—record_curation_event(doc, curator=..., action=..., changes=..., llm_assisted=...)is the standard way to append aCurationEventtodoc['curation_history']. The five writer scripts underscripts/(ground_causal_nodes,ground_causal_predicates,retype_causal_nodes,rename_predicate_labels,seed_from_metpo) already route through both helpers — copy their pattern when adding a new writer.
If a writer audit row shows validates_before_write=no or appends_curation_history=no, the fix is to import and use these two helpers, not to add ad-hoc validation inline.
Cross-references
reports/instance_validation_summary.md— the most recent run's full numbers and what's left.reports/gap_fix_backlog.md— the open items.reports/schema_gap_audit.md— schema findings with file:line refs.reports/pipeline_gap_audit.md— writer audit with safeguard table.- Existing complementary skills:
schema-gap-analysis— the lightweight version of this skill; ~5 min total.
- Cross-Mech framework + new-Mech bootstrap template: claw/.claude/skills/schema-gap-analysis
One-liner runbook
# Generate everything from a clean repo
just validate-strict # Step 2
uv run python scripts/audit_schema.py > /tmp/probes.md # Step 3 probes
uv run python scripts/audit_writers.py \
--out reports/pipeline_writers_audit.tsv # Step 4 probes
# Then hand-compose schema_gap_audit.md, pipeline_gap_audit.md,
# instance_validation_summary.md, gap_fix_backlog.{tsv,md}
# using the TSVs as evidence.