consolidate - SKILL.md Agent Skill

name: consolidate description: "Stage N3. Deep structural analysis of skill against real usage. Verifies gaps, prunes cruft, measures baseline health. Produces structural audit that feeds REM. Front-loaded in early cycles." context: fork user-invocable: false allowed-tools: - Read - Grep - Glob

Consolidate — Stage N3

Deep slow-wave analysis. ~25% of cycle time, concentrated in early cycles. Verifies that the skill's documented facts match reality, identifies structural drift, and optionally prunes accumulated cruft.

Also establishes the baseline measurement for REM's eval loop. Without N3's audit, REM can't measure whether patches improve anything.

Inputs

From orchestrator: sorted_manifest (from N2), target_skill (path), prune_mode (bool, default false), cycle_number.

Procedure

1. Deep Read

Read the entire SKILL.md. Build a structural inventory:

Sections with line ranges and types (context/procedure/reference/config)
Documented assumptions
Explicit limitations
Dependencies (tools, libraries)
Complexity score (0-1)

2. Verify Fragments

For each surviving fragment and theme:

Coverage verification (GAP fragments): Confirm the gap is real. Sometimes coverage exists in a non-obvious section. If confirmed, document precisely where coverage should exist.

Accuracy verification (FRIC fragments): Are the skill's instructions actually correct? Friction sometimes means wrong instructions, not just incomplete ones.

Assumption verification (EDGE fragments): Which documented assumptions break in this scenario?

Output per fragment:

verification:
  fragment_id: f001
  result: confirmed_gap | false_gap | partial_coverage | inaccurate
  evidence: "Sections 3.1-3.7 cover formatting but never mention RTL"
  severity: minor | moderate | major

3. Prune (if prune_mode)

Identify accumulated complexity that no longer serves a purpose:

Dead branches (no fragments or usage touch this area)
Redundancy (multiple sections saying the same thing)
Over-specification (exhaustive detail where a general rule suffices)
Defensive bloat (excessive caveats without failure evidence)

Only suggest removal when confident. A false prune is worse than no prune — you're removing institutional knowledge.

4. Measure Baseline

Produce health scores that REM uses as the "before" measurement:

health_baseline:
  coverage_score: 0.75    # Fraction of real usage covered
  complexity_score: 0.65  # How complex the skill is
  drift_score: 0.3        # Usage vs documented scope divergence
  cruft_score: 0.2        # Unnecessary complexity
  gap_count: 2            # Verified gaps
  friction_count: 1       # Verified friction points

These scores let REM measure whether a patch actually improved things.

5. Output Structural Audit

structural_audit:
  target_skill: /path/to/SKILL.md
  cycle_number: 1
  health_baseline: {coverage_score, complexity_score, drift_score, cruft_score}
  verified_gaps: [{fragment_id, gap, severity, affected_sections, recommended_location}]
  verified_friction: [{fragment_id, issue, severity}]
  assumption_failures: [{fragment_id, assumption, failure_scenario, severity}]
  prune_suggestions: []
  drift_analysis: {scope_expansions, scope_contractions, priority_mismatches}
  fragments_for_rem: [{id, reason}]

N3 by Cycle

Cycle	Depth	Prune	Focus
1	Full	Yes (deep sleep)	Complete structural audit
2	Moderate	No	Verify new fragments only
3	Light	No	Quick delta check
4+	Skip	N/A	N3 absent in late cycles

Behavioral Rules

Be thorough. This is the slow stage. Read every relevant section.
Be conservative with pruning. Only with positive evidence of irrelevance.
Distinguish gaps from limitations. Gaps are improvement targets. Limitations are design decisions. Don't "fix" intentional limitations.
Measure everything. REM needs quantitative baselines to evaluate patches.