zbeam-content-auditor - SKILL.md Agent Skill

name: zbeam-content-auditor description: "Scans all frontmatter pages for thinness and similarity, producing a prioritized remediation map. Run before any batch improvement work."

Z-Beam Content Auditor

Manual-only skill — run before any batch improvement work, or when pipeline-auditor flags a category as needing remediation. Not automated. Trigger: "audit content quality", "scan for thin pages", "what needs improvement", or before running zbeam-category-enricher. Output feeds: zbeam-category-enricher → zbeam-page-updater.

You are scanning all existing frontmatter pages to understand the actual state of content quality — not theoretically, but from the files themselves. The findings feed the category enricher and pipeline auditor.

Output: data/audit/content-audit-[YYYY-MM-DD].json and data/audit/content-audit-[YYYY-MM-DD].md

Important: two distinct page schemas

Material pages and application pages have fundamentally different structures. Score them separately — do not apply application-page standards to material pages.

Material page schema

Key fields: dataCard.machineSettings, dataCard.metrics, properties.materialCharacteristics, properties.laserMaterialInteraction, pageDescription (top-level), faq.items, relationships

Application page schema

Key fields: content.introduction, content.observations, content.context, content.processNotes, content.limitations, pageDescription (top-level), enhancements.summary, faq.items

Step 1: Collect all pages

Incremental mode — skip pages that haven't changed since the last audit:

# Find last audit date
LAST_AUDIT=$(ls data/audit/content-audit-*.json 2>/dev/null | sort | tail -1)

if [ -n "$LAST_AUDIT" ]; then
  AUDIT_DATE=$(stat -c %Y "$LAST_AUDIT" 2>/dev/null || stat -f %m "$LAST_AUDIT")
  echo "Last audit: $LAST_AUDIT — only re-scoring modified files"
  # Files modified since last audit
  find frontmatter/materials frontmatter/applications frontmatter/locations \
    -name "*.yaml" -newer "$LAST_AUDIT" 2>/dev/null
else
  echo "No prior audit found — full scan"
  ls frontmatter/materials/*.yaml frontmatter/applications/*.yaml frontmatter/locations/*.yaml 2>/dev/null
fi

For a full forced rescan (e.g., after schema changes): pass --full flag or delete the last audit file.

Load the previous audit JSON to carry forward scores for unchanged pages — merge new scores with existing entries rather than rebuilding from scratch.

Token-efficient scoring: Use skills/shared/yaml-select.py to extract only the fields needed for scoring rather than loading each full YAML (400–600 lines). This cuts token cost ~80% when auditing all 153+ pages.

# Extract only the fields needed for thinness scoring — no full YAML load
python3 skills/shared/yaml-select.py frontmatter/materials/[slug].yaml \
    pageDescription faq.items dataCard.metrics properties.laserMaterialInteraction.description \
    --compact

Write a Python script to load the changed files and extract scoring data. Run it via bash.

Step 2: Score each page

For material pages — thinness score

dataCard attribution (most critical issue): Check whether dataCard.metrics values have a source field. Current pages have confidence: high but no source URL or citation. Each unattributed metric value counts as a thinness flag.

dataCard vs properties conflict detection: Check for the same metric key in BOTH dataCard.metrics AND properties.laserMaterialInteraction. When both exist, compare values — mismatches indicate one is unattributed generic fill. Flag as a conflict with both values listed. Common conflict fields: ablationThreshold, laserDamageThreshold, thermalDiffusivity. The properties.laserMaterialInteraction value is typically better sourced and should be treated as authoritative when it has a citation.

Parameter range plausibility: Check dataCard.machineSettings for implausibly wide ranges that suggest generic fill rather than material-specific research. A min: 0.1, max: 20 energy density range applies to everything — it is not material-specific data.

sectionDescription vs description duplication: Check whether properties.laserMaterialInteraction.section.sectionDescription and properties.laserMaterialInteraction.description say essentially the same thing. The component renders BOTH — section.sectionDescription as section subtitle and description as a prose paragraph above PropertyBars. If both cover the same topic, users read the same content twice. Flag when first 8 words overlap significantly. The fix: sectionDescription should be neutral framing; description carries the insight.

Missing structured properties with sibling available: Check whether properties.materialCharacteristics and properties.laserMaterialInteraction have ONLY section, label, description, percentage (no numeric fields with value keys). If so, check whether a more specific sibling page exists (e.g., stainless-steel-304 for general stainless-steel). If a richer sibling exists, flag this as a structured-properties gap — the page-updater can port the structured fields from the sibling.

Properties description specificity: Check properties.laserMaterialInteraction.description for generic language. Flag sentences containing: "bounces back", "challenging", "precise parameter control", "effective", "cleans components", "without damaging", "preserves material integrity" as template language likely shared across multiple pages.

FAQ formula detection: Check faq.items for questions matching the pattern "How is [material] laser cleaner used on [material]?" — this formula appears across 142 of 153 pages and indicates template generation, not real user questions.

Schema discovery required: before extracting FAQ text, sniff list(items[0].keys()). Material pages use title/content; other page types may use question/answer. Hardcoding either key silently returns empty strings on the wrong page type. See skills/YAML_UTILS.md for the canonical multi-key fallback (get_field()).

pageDescription thinness (blocking quality flag): Read pageDescription (top-level). Flag as critical if it is under 60 words or contains no quantitative data (no measurements, regulatory thresholds, or sourced findings). A thin pageDescription is a Dim 4 FAIL at validation. In the thinnessSummary, add thinPageDescription count.

Also flag any page that still has a top-level aiSummary key — this is a legacy field that should be migrated to pageDescription (top-level) and removed.

Section word-count uniformity (machine-authorship signal):

python3 -c "
import yaml, statistics, sys
data = yaml.safe_load(open(sys.argv[1]))
text = yaml.dump(data)
import re
# Extract sectionDescription and description values
fields = re.findall(r'(?:sectionDescription|description):\s*[>|]?-?\s*\n((?:[ \t]+.+\n?)+)', open(sys.argv[1]).read())
counts = [len(f.split()) for f in fields if len(f.split()) > 5]
if len(counts) >= 4:
    sd = statistics.stdev(counts)
    print(f'Field std dev: {sd:.1f} words across {len(counts)} fields')
    if sd < 30:
        print('⚠️  UNIFORM — machine-authorship signal')
" [path]

Flag pages with std dev < 30 words across 4+ description/sectionDescription fields as uniformLengths: true in the remediation map entry.

For application pages — thinness score

Count sentences in all prose fields containing:

Specific measurements with units (J/cm², nm, ns, Ra, μg/m³)
Regulatory rule numbers (BAAQMD Rule X, Cal/OSHA Title 8 Section X)
Standard designations (SSPC-SP X, NACE No. X, AWS D1.X)

Flag sections with zero quantitative specifics as thin.

Similarity score — both page types

Run this analysis per category/subcategory group:

Lexical overlap: extract the first sentence of the primary description field (properties.laserMaterialInteraction.description for materials, content.introduction.sectionDescription for applications). Count how many pages in the same subcategory share the same opening 5-word pattern.

Structural uniformity: check faq.items count — 142/153 material pages have exactly 4 items. Flag any subcategory where >80% of pages have identical FAQ counts.

Vocabulary fingerprint: extract the 10 most distinctive non-stopword terms from each page's description fields. Pages in the same subcategory sharing >5 of these 10 terms are flagged as lexically similar.

Step 3: Cross-reference with performance data

Load the most recent file matching data/search-console/weekly-summary-*.json (sort by date, take latest). For each page URL, look up impressions and clicks.

Load the most recent file matching data/ai-search/weekly-*.json (sort by date, take latest) if available. For each query, check whether the page URL appears in citation results.

This produces a priority score combining:

Thinness score (how much improvement is possible)
Performance potential (is anyone searching for this?)
Citation gap (is AI search ignoring this page?)

Priority tiers:

Critical: High GSC impressions OR any AI tracker coverage + thin/generic
High: Medium search potential + thin
Medium: Low search potential + thin (improve once higher-priority pages are done)
Monitor: Adequate content quality — watch for performance decline

Step 4: Category summary

Group pages by category and subcategory. For each group produce:

Average thinness score
Similarity index (0–1, where 1 = all pages identical)
Count of pages needing remediation
Recommended intervention type

This feeds directly into zbeam-category-enricher — categories with the most pages needing remediation and the highest search potential go first.

Output format

Read references/output-schema.md for the full JSON and markdown output schemas. Needed when producing the final audit files.