zbeam-citation-builder - SKILL.md Agent Skill

name: zbeam-citation-builder description: "Verifies source URLs, extracts verbatim quotes, flags unverifiable citations. Run after page-updater or content-writer, before validator."

Z-Beam Citation Builder

You verify and harden every citation in a frontmatter YAML file. No placeholder data, no paraphrased quotes, no dead URLs. Every source must be fetched and confirmed before the page can go to the validator.

Input required: path to the YAML file Output: updated YAML file + verification report at data/audit/citations-[slug]-[YYYY-MM-DD].md

⛔ Hard rules — no exceptions

No fake or placeholder URLs. Every url field must be a real, resolvable https:// URL pointing to the actual source. URLs containing placeholder patterns (example.com, [doi], /abs/pii/ fragments that are clearly constructed, your-url-here, etc.) are forbidden — treat them as FAIL immediately without fetching.
Never fabricate or paraphrase a quote. The claim field must contain verbatim text pulled from the source page, in quotation marks, ≤30 words.
Never publish a citation with unverified: true. Flag it and stop — the page cannot go to the validator until all citations are resolved.
Never use a URL that returns anything other than HTTP 200. Find the DOI.org equivalent or a stable alternative.
If a citation cannot be grounded in a real, fetchable URL: remove the claim from the page rather than leave a fake URL in place. A page with fewer citations is better than a page with fabricated sources.

Step 0: Load blocked-sources registry (adaptive — skip known-dead URLs)

import json, os

blocked_path = 'data/audit/blocked-sources.json'
blocked_urls = set()
if os.path.exists(blocked_path):
    blocked = json.load(open(blocked_path))
    blocked_urls = set(blocked.get('urls', []))
    print(f"Blocked-source registry: {len(blocked_urls)} known-dead URLs — will skip without fetching")
else:
    print("No blocked-source registry yet — all URLs will be fetched")

After the full verification pass (Step 2–3), append any newly-failed URLs to the registry:

# Run after Step 3 — before generating the report
if newly_failed_urls:
    blocked = json.load(open(blocked_path)) if os.path.exists(blocked_path) else {'urls': [], 'lastUpdated': ''}
    added = 0
    for url in newly_failed_urls:
        if url not in blocked['urls']:
            blocked['urls'].append(url)
            added += 1
    blocked['lastUpdated'] = date.today().isoformat()
    json.dump(blocked, open(blocked_path, 'w'), indent=2)
    print(f"Added {added} newly-failed URLs to blocked-source registry")

Step 1: Collect all citations in the file

Load the YAML. Collect every sources array entry from:

properties.materialCharacteristics.sources
properties.laserMaterialInteraction.sources
faq.items[*].sources
Any other field containing a sources array

Also collect dataCard.metrics[*] entries that have a source field — these follow the same verification process.

Application pages — root-level citations[] block:

If the YAML is under frontmatter/applications/, also collect every entry in the top-level citations: array:

import yaml

data = yaml.safe_load(open(path))
app_citations = data.get('citations', [])

if app_citations:
    print(f'Application citations[]: {len(app_citations)} entries')
    for i, c in enumerate(app_citations):
        url = c.get('url', '')
        short = c.get('shortName', f'[{i}]')
        claim = c.get('claim', '')
        _placeholder_patterns = ['example.com', '[doi]', 'your-url', 'placeholder', 'insert-url']
        _is_placeholder = not url.startswith('https://') or any(p in url.lower() for p in _placeholder_patterns)
        if _is_placeholder:
            print(f'FAIL: citations[{i}] ({short}): fake or placeholder URL — {url[:60]} — remove citation or find real source')
        else:
            urls_to_check.append({'path': f'citations[{i}]', 'shortName': short, 'url': url, 'claim': claim})
elif path.startswith('frontmatter/applications/'):
    print('FAIL: application page has no citations[] block — Dim 3 will fail')
    print('Action: run zbeam-page-updater Step 2h or add citations manually from diff brief')

Application citations[] entries follow the same URL-fetch verification as material sources entries (Step 2–3). claim must match verbatim or near-verbatim text found on the source page.

Inline citation audit — pageDescription (top-level):

pageDescription carries inline (Author, Year) references but has no own sources block. After collecting the structured sources above, scan pageDescription text for inline citation patterns:

import re, yaml

data = yaml.safe_load(open(path))
page_desc = data.get('pageDescription', '')

# Match parenthetical citations: (Author et al., 2023) or (JMRT, 2026) etc.
inline_refs = re.findall(r'\(([^)]{3,80}?(?:19|20)\d{2}[^)]*)\)', page_desc)

# Build index of known shortNames from structured sources
known = {}
for src_block in [data.get('properties', {}).get('materialCharacteristics', {}).get('sources', []),
                  data.get('properties', {}).get('laserMaterialInteraction', {}).get('sources', [])]:
    for s in src_block:
        sn = s.get('shortName', '')
        cit = s.get('citation', '')
        if sn: known[sn.lower()] = s
        # Extract year+author fragment for fuzzy match
        m = re.search(r'(\w+)[,\s]+(\d{4})', cit)
        if m: known[f"{m.group(1).lower()} {m.group(2)}"] = s

for ref in inline_refs:
    matched = any(k in ref.lower() for k in known)
    if not matched:
        print(f'UNANCHORED in pageDescription: ({ref}) — no matching sources entry found')

For each unanchored inline reference in pageDescription, flag it in the verification report under "pageDescription citation gaps". These must either be resolved by adding a matching entry to the nearest relevant sources block, or the inline reference must be removed.

Build a deduplicated list of unique URLs to check.

Steps 2–3: Verify and extract in a single fetch pass

Do not fetch each URL twice. Combine status check and quote extraction into one loop — fetch once, do both operations on the returned content.

import re

def verify_and_extract(url, claim):
    """Fetch URL once. Return (status, verbatim_quote_or_None, working_url)."""
    try:
        # Use mcp__workspace__web_fetch — returns content or error
        result = fetch(url)  # web_fetch call
        if not result or 'error' in result.lower()[:50]:
            return 'ERROR', None, url
        
        # Status is 200 if content returned — extract quote in same pass
        key_terms = re.findall(r'[\d.]+(?:\s*[–-]\s*[\d.]+)?', claim)
        sentences = re.split(r'(?<=[.!?])\s+', result)
        for term in key_terms:
            matching = [s for s in sentences if term.replace('–','-') in s.replace('–','-')]
            if matching:
                quote = ' '.join(matching[0].split()[:30])
                return '200', quote, url
        
        # Content fetched but quote not found (JS-rendered or claim mismatch)
        return '200-no-quote', None, url
    except Exception:
        return 'ERROR', None, url

For each unique URL in the citations list:

Call verify_and_extract(url, claim)
If status ERROR → try DOI redirect or PubMed mirror, then call again with working URL
If status 200 with quote → set verified: true, update claim to verbatim quote
If status 200-no-quote → mark unverified: true, note "content fetched but claim text not locatable — manual verification required"
If no working URL found → mark unverified: true

If URL fails or returns non-200:

Try DOI: if URL contains a DOI pattern, fetch https://doi.org/[doi]
Try PubMed/ResearchGate/MDPI mirror search
If working URL found: update url field in YAML and retry
If not: mark unverified: true

Quote format (YAML-safe):

claim: '"complete removal threshold range was determined to be 3.34–3.76 J/cm²"'

Outer single quotes wrap inner double-quoted verbatim text.

Step 4: Update the YAML

For each citation, write back:

url: verified working URL (updated if original was broken)
claim: verbatim quoted text from source, ≤30 words, in double quotes
verified: true if URL returned 200 and quote was extracted
unverified: true if either check failed

Use surgical string replacement — never yaml.dump().

Validate after every write:

python3 -c "import yaml; yaml.safe_load(open('[path]'))" && echo "valid"

Step 5: Generate verification report

Save to: data/audit/citations-[slug]-[YYYY-MM-DD].md

# Citation Verification Report: [slug]
**Date**: [date]
**Total citations checked**: [n]
**Verified**: [n]
**Unverified**: [n] — page cannot go to validator until resolved

## Verified citations

| Field | URL status | Quote extracted |
|---|---|---|
| properties.laserMaterialInteraction.sources[0] | 200 ✅ | "...verbatim quote..." |

## Unverified citations — BLOCKING

| Field | URL | Issue | Resolution required |
|---|---|---|---|
| faq.items[2].sources[0] | 404 ❌ | DOI redirect failed | Find working mirror or remove citation |

## URL corrections made
[List any URLs that were updated to DOI equivalents or mirrors]

What this skill does not do

Does not write new citations — only verifies and hardens existing ones
Does not remove citations that fail — marks them unverified: true for human resolution
Does not pass a page to the validator if any citation is unverified
Does not accept paraphrased or summarized quotes — verbatim only

Step 5b: Write researcher fallback request

Read references/citation-protocol.md. Run this step when any citations remain unverified: true after Step 4 — it writes a targeted re-sourcing request for the content-researcher and includes the researcher pickup check code.