name: zbeam-content-auditor description: "Scans all frontmatter pages for thinness and similarity, producing a prioritized remediation map. Run before any batch improvement work."
Z-Beam Content Auditor
Manual-only skill — run before any batch improvement work, or when pipeline-auditor flags a category as needing remediation. Not automated. Trigger: "audit content quality", "scan for thin pages", "what needs improvement", or before running zbeam-category-enricher. Output feeds: zbeam-category-enricher → zbeam-page-updater.
You are scanning all existing frontmatter pages to understand the actual state of content quality — not theoretically, but from the files themselves. The findings feed the category enricher and pipeline auditor.
Output: data/audit/content-audit-[YYYY-MM-DD].json and
data/audit/content-audit-[YYYY-MM-DD].md
Important: two distinct page schemas
Material pages and application pages have fundamentally different structures. Score them separately — do not apply application-page standards to material pages.
Material page schema
Key fields: dataCard.machineSettings, dataCard.metrics, properties.materialCharacteristics,
properties.laserMaterialInteraction, pageDescription (top-level),
faq.items, relationships
Application page schema
Key fields: content.introduction, content.observations, content.context,
content.processNotes, content.limitations, pageDescription (top-level),
enhancements.summary, faq.items
Step 1: Collect all pages
Incremental mode — skip pages that haven't changed since the last audit:
# Find last audit date
LAST_AUDIT=$(ls data/audit/content-audit-*.json 2>/dev/null | sort | tail -1)
if [ -n "$LAST_AUDIT" ]; then
AUDIT_DATE=$(stat -c %Y "$LAST_AUDIT" 2>/dev/null || stat -f %m "$LAST_AUDIT")
echo "Last audit: $LAST_AUDIT — only re-scoring modified files"
# Files modified since last audit
find frontmatter/materials frontmatter/applications frontmatter/locations \
-name "*.yaml" -newer "$LAST_AUDIT" 2>/dev/null
else
echo "No prior audit found — full scan"
ls frontmatter/materials/*.yaml frontmatter/applications/*.yaml frontmatter/locations/*.yaml 2>/dev/null
fi
For a full forced rescan (e.g., after schema changes): pass --full flag or delete the last audit file.
Load the previous audit JSON to carry forward scores for unchanged pages — merge new scores with existing entries rather than rebuilding from scratch.
Token-efficient scoring: Use skills/shared/yaml-select.py to extract only the fields needed for scoring rather than loading each full YAML (400–600 lines). This cuts token cost ~80% when auditing all 153+ pages.
# Extract only the fields needed for thinness scoring — no full YAML load
python3 skills/shared/yaml-select.py frontmatter/materials/[slug].yaml \
pageDescription faq.items dataCard.metrics properties.laserMaterialInteraction.description \
--compact
Write a Python script to load the changed files and extract scoring data. Run it via bash.
Step 2: Score each page
For material pages — thinness score
dataCard attribution (most critical issue):
Check whether dataCard.metrics values have a source field. Current pages have
confidence: high but no source URL or citation. Each unattributed metric value
counts as a thinness flag.
dataCard vs properties conflict detection:
Check for the same metric key in BOTH dataCard.metrics AND
properties.laserMaterialInteraction. When both exist, compare values — mismatches
indicate one is unattributed generic fill. Flag as a conflict with both values listed.
Common conflict fields: ablationThreshold, laserDamageThreshold, thermalDiffusivity.
The properties.laserMaterialInteraction value is typically better sourced and should
be treated as authoritative when it has a citation.
Parameter range plausibility:
Check dataCard.machineSettings for implausibly wide ranges that suggest generic
fill rather than material-specific research. A min: 0.1, max: 20 energy density
range applies to everything — it is not material-specific data.
sectionDescription vs description duplication:
Check whether properties.laserMaterialInteraction.section.sectionDescription and
properties.laserMaterialInteraction.description say essentially the same thing.
The component renders BOTH — section.sectionDescription as section subtitle and
description as a prose paragraph above PropertyBars. If both cover the same topic,
users read the same content twice. Flag when first 8 words overlap significantly.
The fix: sectionDescription should be neutral framing; description carries the insight.
Missing structured properties with sibling available:
Check whether properties.materialCharacteristics and properties.laserMaterialInteraction
have ONLY section, label, description, percentage (no numeric fields with
value keys). If so, check whether a more specific sibling page exists
(e.g., stainless-steel-304 for general stainless-steel). If a richer sibling
exists, flag this as a structured-properties gap — the page-updater can port
the structured fields from the sibling.
Properties description specificity:
Check properties.laserMaterialInteraction.description for generic language.
Flag sentences containing: "bounces back", "challenging", "precise parameter control",
"effective", "cleans components", "without damaging", "preserves material integrity"
as template language likely shared across multiple pages.
FAQ formula detection:
Check faq.items for questions matching the pattern "How is [material] laser cleaner
used on [material]?" — this formula appears across 142 of 153 pages and indicates
template generation, not real user questions.
Schema discovery required: before extracting FAQ text, sniff list(items[0].keys()).
Material pages use title/content; other page types may use question/answer.
Hardcoding either key silently returns empty strings on the wrong page type.
See skills/YAML_UTILS.md for the canonical multi-key fallback (get_field()).
pageDescription thinness (blocking quality flag):
Read pageDescription (top-level). Flag as critical if it is under 60 words
or contains no quantitative data (no measurements, regulatory thresholds, or sourced findings).
A thin pageDescription is a Dim 4 FAIL at validation. In the thinnessSummary, add thinPageDescription count.
Also flag any page that still has a top-level aiSummary key — this is a legacy field that
should be migrated to pageDescription (top-level) and removed.
Section word-count uniformity (machine-authorship signal):
python3 -c "
import yaml, statistics, sys
data = yaml.safe_load(open(sys.argv[1]))
text = yaml.dump(data)
import re
# Extract sectionDescription and description values
fields = re.findall(r'(?:sectionDescription|description):\s*[>|]?-?\s*\n((?:[ \t]+.+\n?)+)', open(sys.argv[1]).read())
counts = [len(f.split()) for f in fields if len(f.split()) > 5]
if len(counts) >= 4:
sd = statistics.stdev(counts)
print(f'Field std dev: {sd:.1f} words across {len(counts)} fields')
if sd < 30:
print('⚠️ UNIFORM — machine-authorship signal')
" [path]
Flag pages with std dev < 30 words across 4+ description/sectionDescription fields as
uniformLengths: true in the remediation map entry.
For application pages — thinness score
Count sentences in all prose fields containing:
- Specific measurements with units (J/cm², nm, ns, Ra, μg/m³)
- Regulatory rule numbers (BAAQMD Rule X, Cal/OSHA Title 8 Section X)
- Standard designations (SSPC-SP X, NACE No. X, AWS D1.X)
Flag sections with zero quantitative specifics as thin.
Similarity score — both page types
Run this analysis per category/subcategory group:
Lexical overlap: extract the first sentence of the primary description field
(properties.laserMaterialInteraction.description for materials,
content.introduction.sectionDescription for applications). Count how many pages
in the same subcategory share the same opening 5-word pattern.
Structural uniformity: check faq.items count — 142/153 material pages have
exactly 4 items. Flag any subcategory where >80% of pages have identical FAQ counts.
Vocabulary fingerprint: extract the 10 most distinctive non-stopword terms from each page's description fields. Pages in the same subcategory sharing >5 of these 10 terms are flagged as lexically similar.
Step 3: Cross-reference with performance data
Load the most recent file matching data/search-console/weekly-summary-*.json (sort by date, take latest).
For each page URL, look up impressions and clicks.
Load the most recent file matching data/ai-search/weekly-*.json (sort by date, take latest) if available.
For each query, check whether the page URL appears in citation results.
This produces a priority score combining:
- Thinness score (how much improvement is possible)
- Performance potential (is anyone searching for this?)
- Citation gap (is AI search ignoring this page?)
Priority tiers:
- Critical: High GSC impressions OR any AI tracker coverage + thin/generic
- High: Medium search potential + thin
- Medium: Low search potential + thin (improve once higher-priority pages are done)
- Monitor: Adequate content quality — watch for performance decline
Step 4: Category summary
Group pages by category and subcategory. For each group produce:
- Average thinness score
- Similarity index (0–1, where 1 = all pages identical)
- Count of pages needing remediation
- Recommended intervention type
This feeds directly into zbeam-category-enricher — categories with the most
pages needing remediation and the highest search potential go first.
Output format
Read references/output-schema.md for the full JSON and markdown output schemas. Needed when producing the final audit files.