data-normalization

name: data-normalization description: > Project-agnostic data normalization patterns for the 28-file Project Brain. Defines 8 repeatable patterns (N1-N8) that fix common data quality issues: status standardization, counter reconciliation, cross-reference linking, field backfill, schema compliance, computed totals, date/format normalization, and deduplication. Safe to re-run (idempotent). Outputs a repair plan before executing. version: 1.0.0

Data Normalization

Overview

This skill defines 8 normalization patterns that can be applied to any ForemanOS Project Brain to fix common data quality issues. Each pattern is idempotent (safe to re-run) and produces a repair plan showing what will change before executing.

The patterns complement the detection layer (data-integrity-watchdog agent, conflict-detection-agent) by providing the fix side of the data health stack.

Prerequisites

Before applying any pattern, the AI must:

Read the target JSON file(s) in full
Read ${CLAUDE_PLUGIN_ROOT}/skills/project-data/SKILL.md for canonical key names
Generate a repair plan listing every proposed change (field path, before value, after value)
Present the repair plan to the user for confirmation
Apply changes via a single atomic Python script that reads → modifies → writes all affected files
Validate the output (re-read and confirm changes applied)

Repair Plan Format

Before any changes, output a table:

| # | File | Path | Before | After | Pattern |
|---|------|------|--------|-------|---------|
| 1 | submittal-log.json | submittal_log[1].status | "submitted" | "SUBMITTED" | N1 |
| 2 | cost-data.json | summary.contingency_remaining | (missing) | 43225 | N6 |

Only proceed after user confirmation.

Pattern N1: Status Standardization

Problem: Mixed case and wording in status fields across different JSON files.

Logic: Each file type has canonical status enums. Normalize all status fields to match.

Canonical Status Values

File	Field	Valid Values
`submittal-log.json`	`submittal_log[].status`	`APPROVED`, `SUBMITTED`, `REVIEW_PENDING`, `REJECTED`, `RESUBMIT`, `APPROVED_AS_NOTED`
`submittal-log.json`	`submittal_log[].compliance_matrix.compliance_status`	`COMPLIANT`, `NON_COMPLIANT`, `REVIEW_PENDING`, `PARTIAL`
`inspection-log.json`	`inspection_log[].result`	`pass`, `fail`, `conditional`, `pending`
`risk-register.json`	`risks[].status`	`open`, `closed`, `monitoring`, `mitigated`
`procurement-log.json`	`procurement_log[].status`	`quoted`, `executed`, `executed_with_co`, `delivered`, `cancelled`
`rfi-log.json`	`rfi_log[].status`	`Draft`, `Submitted`, `Responded`, `Closed`
`change-order-log.json`	`change_order_log[].status`	`pending`, `approved`, `rejected`, `executed`
`punch-list.json`	`punch_list[].status`	`open`, `in_progress`, `completed`, `verified`
`action-items.json`	`items[].status`	`open`, `in_progress`, `resolved`, `deferred`
`project-config.json`	`documents_loaded[].status`	`current`, `extracted`, `received`, `finalized`, `filed`, `template`, `submitted`, `executed`, `superseded`, `superseded_by_conformance`, `logged`

Normalization Rules

Case-insensitive match to canonical value (e.g., "submitted" -> "SUBMITTED" for submittals)
Synonym mapping:
- "approved" / "APPROVED" / "Approved" -> APPROVED
- "review_pending" / "Review Pending" / "REVIEW PENDING" / "pending_review" -> REVIEW_PENDING
- "assumed_passed" -> pass (for inspections)
- "in progress" / "In Progress" / "IN_PROGRESS" -> in_progress
If no match found, flag for manual review (do not auto-fix)

Application

# Pseudocode
for entry in submittal_log:
    normalized = normalize_status(entry["status"], SUBMITTAL_STATUSES)
    if normalized != entry["status"]:
        plan.append(change(entry, "status", entry["status"], normalized, "N1"))

Pattern N2: Counter Reconciliation

Problem: Stored scalar counts drift from actual array lengths after edits.

Logic: Recount from source arrays, compare to stored scalar, fix if mismatched.

Known Counter Fields

File	Scalar Field	Source Array	Expected
`project-config.json`	`documents_loaded_count`	`documents_loaded[]`	`len(documents_loaded)`
`project-config.json`	`total_documents`	`documents_loaded[]`	`len(documents_loaded)`
`plans-spatial.json`	`room_count` (if present)	`room_schedule[]`	`len(room_schedule)`
`delay-log.json`	`total_weather_days` (if present)	`delay_events[]` where type=Weather	Sum of `days_impact`
`safety-log.json`	`osha_300_log.total_hours`	`labor-tracking.json` labor_entries	Sum of `hours`
`cost-data.json`	`summary.subtotal_direct`	`budget_by_division[]`	Sum of `total`
`pay-app-log.json`	`schedule_of_values.total_contract`	`schedule_of_values.line_items[]`	Sum of `scheduled_value`

Application

actual = len(data["documents_loaded"])
stored = data.get("documents_loaded_count", None)
if stored != actual:
    plan.append(change("documents_loaded_count", stored, actual, "N2"))

Pattern N3: Cross-Reference Linking

Problem: Missing bidirectional references between related records across files.

Logic: Match by vendor name, spec section, activity ID, or other shared identifiers. Add link fields.

Link Definitions

Source File	Source Field	Target File	Target Field	Match On
`submittal-log.json`	`submittal_log[].procurement_link`	`procurement-log.json`	`procurement_log[].po_number`	Vendor name match
`procurement-log.json`	`procurement_log[].submittal_ids[]`	`submittal-log.json`	`submittal_log[].submittal_id`	Vendor name match
`rfi-log.json`	`rfi_log[].related_submittals[]`	`submittal-log.json`	`submittal_log[].submittal_id`	Spec section match
`schedule.json`	`activities[].linked_submittals[]`	`submittal-log.json`	`submittal_log[].submittal_id`	Activity ID match
`change-order-log.json`	`change_order_log[].procurement_link`	`procurement-log.json`	`procurement_log[].po_number`	Description/vendor match
`inspection-log.json`	`inspection_log[].linked_activity`	`schedule.json`	`activities[].id`	Activity match
`quality-data.json`	`inspections[].submittal_ref`	`submittal-log.json`	`submittal_log[].submittal_id`	Spec section match

Matching Rules

Vendor match: Case-insensitive substring match on supplier/vendor name. "Schiller" matches "Schiller Architectural Hardware".
Spec section match: Normalize to format "XX XX XX" (spaces, no dashes). "08 11 13" matches "081113".
Activity ID match: Exact match on linked_schedule_activity_id or activity_id.
Only add links — never remove existing links. If a link field already exists and has a value, skip.

Application

# Match submittals to procurement by vendor
for sub in submittal_log:
    if "procurement_link" not in sub or not sub["procurement_link"]:
        for po in procurement_log:
            if vendor_match(sub.get("vendor",""), po.get("supplier","")):
                sub["procurement_link"] = po["po_number"]
                po.setdefault("submittal_ids", [])
                if sub["submittal_id"] not in po["submittal_ids"]:
                    po["submittal_ids"].append(sub["submittal_id"])
                break

Pattern N4: Field Backfill

Problem: Array entries missing required fields (status, id, type, dates).

Logic: Apply defaults based on existing data context.

Required Fields by File

File	Array	Required Fields	Default Logic
`project-config.json`	`documents_loaded[]`	`id`, `status`	`id`: "DOC-{index+1:03d}". `status`: "extracted" if has extraction_data, "received" if has filename only
`submittal-log.json`	`submittal_log[]`	`submittal_id`, `status`, `spec_section`	`submittal_id`: "SUB-{index+1:03d}". `status`: "SUBMITTED" if date exists
`procurement-log.json`	`procurement_log[]`	`po_number`, `status`, `supplier`	`status`: "quoted" if has quote_date, "executed" if has executed_date
`rfi-log.json`	`rfi_log[]`	`id`, `status`, `date_created`	`id`: "RFI-{index+1:03d}". `status`: "Draft" if no response
`inspection-log.json`	`inspection_log[]`	`id`, `result`, `date`	`result`: "pending" if no result recorded
`labor-tracking.json`	`labor_entries[]`	`id`, `date`, `company`	`id`: "LBR-{index+1:03d}"
`risk-register.json`	`risks[]`	`id`, `status`, `probability`, `impact`	`status`: "open" if no status

Application

for i, doc in enumerate(documents_loaded):
    if "id" not in doc:
        doc["id"] = f"DOC-{i+1:03d}"
        plan.append(change(f"documents_loaded[{i}].id", "(missing)", doc["id"], "N4"))
    if "status" not in doc:
        status = infer_status(doc)
        doc["status"] = status
        plan.append(change(f"documents_loaded[{i}].status", "(missing)", status, "N4"))

Pattern N5: Key Schema Compliance

Problem: Wrong key names or missing required top-level keys.

Logic: Check against canonical keys from the project-data skill. Rename or add missing keys.

Canonical Top-Level Keys

Reference: foremanos-core/skills/project-data/references/json-schema-reference.md

File	Required Top-Level Keys
`schedule.json`	`activities`, `milestones`, `critical_path`, `weather_sensitive_activities`
`delay-log.json`	`delay_events`
`rfi-log.json`	`rfi_log`
`meeting-log.json`	`meeting_log`
`inspection-log.json`	`inspection_log`, `permit_log`

Common Renames

Wrong Key	Correct Key	File
`construction_activities`	`activities`	schedule.json
`delays`	`delay_events`	delay-log.json
`rfis`	`rfi_log`	rfi-log.json
`meetings`	`meeting_log`	meeting-log.json

Application

if "construction_activities" in schedule and "activities" not in schedule:
    schedule["activities"] = schedule.pop("construction_activities")
    plan.append(rename("construction_activities", "activities", "N5"))

Pattern N6: Computed Totals

Problem: Missing aggregated values that should be computed from child records.

Logic: Sum from child records, write to parent field.

Computed Fields

File	Computed Field	Formula
`cost-data.json`	`summary.subtotal_direct`	`sum(budget_by_division[].total)`
`cost-data.json`	`summary.contingency_used`	`sum(change_orders where source=contingency)` or from CO log
`cost-data.json`	`summary.contingency_remaining`	`contingency - contingency_used`
`cost-data.json`	`summary.total_committed`	`sum(budget_by_division[].committed_costs)`
`cost-data.json`	`budget_by_division[].current_amount`	`total + sum(applied_cos[].amount)`
`risk-register.json`	`total_exposure`	`sum(risks[].exposure)`
`pay-app-log.json`	`current_retainage`	Computed from pay app percentages
`labor-tracking.json`	`total_hours`	`sum(labor_entries[].hours)`
`delay-log.json`	`total_delay_days`	`sum(delay_events[].days_impact)`

Application

used = sum(co["amount"] for co in change_orders if co.get("source") == "contingency")
# Or from known COs:
used = 6775  # Nucor CO#1
remaining = summary["contingency"] - used
if "contingency_used" not in summary:
    summary["contingency_used"] = used
    plan.append(change("summary.contingency_used", "(missing)", used, "N6"))
if "contingency_remaining" not in summary:
    summary["contingency_remaining"] = remaining
    plan.append(change("summary.contingency_remaining", "(missing)", remaining, "N6"))

Pattern N7: Date/Format Normalization

Problem: Inconsistent date formats and number types across files.

Logic: Standardize all dates to ISO 8601, all numeric values as numbers (not strings).

Rules

Dates: All date fields must be ISO 8601 format: YYYY-MM-DD
- "01/22/2026" -> "2026-01-22"
- "Jan 22, 2026" -> "2026-01-22"
- "1/22/26" -> "2026-01-22"
- If ambiguous (e.g., "02/03/26"), prefer US format (MM/DD/YY)
Numbers: Numeric fields must be actual numbers, not strings
- "184500" -> 184500
- "$184,500" -> 184500
- "15.5%" -> 0.155 (for rate fields) or 15.5 (for display fields — context-dependent)
Booleans: Boolean fields must be actual booleans
- "true" / "yes" / "Y" -> true
- "false" / "no" / "N" -> false

Date Fields to Check

File	Date Fields
`project-config.json`	`documents_loaded[].date`, `version_history[].date`
`submittal-log.json`	`submittal_log[].date`, `[].review_date`, `[].must_submit_by_date`
`procurement-log.json`	`procurement_log[].quote_date`, `[].executed_date`, `[].delivery_schedule.*`
`schedule.json`	`milestones[].date`, `activities[].start`, `[].finish`
`rfi-log.json`	`rfi_log[].date_created`, `[].date_responded`
`change-order-log.json`	`change_order_log[].date`, `[].approved_date`
`inspection-log.json`	`inspection_log[].date`, `[].next_date`

Pattern N8: Deduplication

Problem: Duplicate entries in arrays from multiple extraction passes or manual entry.

Logic: Detect by ID field or fuzzy match on description + date. Merge or flag.

Detection Rules

Exact ID match: Two entries with same id / submittal_id / po_number -> merge (keep the one with more fields populated)
Fuzzy match: Same vendor + same spec_section + dates within 7 days -> flag for review
Content match: Same description (case-insensitive, trimmed) + same date -> merge

Merge Strategy

When merging duplicates:

Keep the entry with more populated fields
Copy any non-null fields from the less-complete entry that are missing in the keeper
Record the merge in a _merge_note field: "Merged from duplicate entry on {date}"
Never delete — mark the duplicate with _duplicate_of: "{keeper_id}" and remove from the main array

Application

seen = {}
for entry in submittal_log:
    key = entry.get("submittal_id")
    if key in seen:
        # Merge: keep entry with more fields
        keeper = seen[key] if len(seen[key]) >= len(entry) else entry
        donor = entry if keeper is seen[key] else seen[key]
        for k, v in donor.items():
            if k not in keeper or keeper[k] is None:
                keeper[k] = v
        keeper["_merge_note"] = f"Merged duplicate {donor.get('submittal_id')}"
        plan.append(merge(key, "N8"))
    else:
        seen[key] = entry

Execution Order

When running all 8 patterns on a Project Brain:

N5 (Schema Compliance) — Fix key names first so other patterns find data
N7 (Date/Format) — Normalize types before comparisons
N8 (Deduplication) — Remove duplicates before counting
N1 (Status Standardization) — Normalize status values
N4 (Field Backfill) — Fill missing required fields
N2 (Counter Reconciliation) — Recount after dedup and backfill
N3 (Cross-Reference Linking) — Link after all entries are clean
N6 (Computed Totals) — Compute aggregates last

Integration with `/data-health` Command

The /data-health command in the foremanos-compliance plugin orchestrates these patterns:

/data-health scan — Runs detection only (finds issues that N1-N8 would fix)
/data-health fix — Applies patterns with confirmation
/data-health report — Generates health score including normalization coverage

Constraints

Never delete data — Only add, rename, or modify fields
Always confirm — Present repair plan before any changes
Atomic writes — All changes to a file happen in one write operation
Preserve formatting — Use json.dumps(data, indent=2, ensure_ascii=False) for output

Log changes — After applying, add a _normalization_log entry to project-config.json version_history:

{
  "date": "2026-02-26",
  "action": "data_normalization",
  "patterns_applied": ["N1", "N3", "N6"],
  "changes_count": 12,
  "files_modified": ["submittal-log.json", "procurement-log.json", "cost-data.json"]
}