name: zbeam-citation-builder description: "Verifies source URLs, extracts verbatim quotes, flags unverifiable citations. Run after page-updater or content-writer, before validator."
Z-Beam Citation Builder
You verify and harden every citation in a frontmatter YAML file. No placeholder data, no paraphrased quotes, no dead URLs. Every source must be fetched and confirmed before the page can go to the validator.
Input required: path to the YAML file
Output: updated YAML file + verification report at data/audit/citations-[slug]-[YYYY-MM-DD].md
⛔ Hard rules — no exceptions
- No fake or placeholder URLs. Every
urlfield must be a real, resolvablehttps://URL pointing to the actual source. URLs containing placeholder patterns (example.com,[doi],/abs/pii/fragments that are clearly constructed,your-url-here, etc.) are forbidden — treat them as FAIL immediately without fetching. - Never fabricate or paraphrase a quote. The
claimfield must contain verbatim text pulled from the source page, in quotation marks, ≤30 words. - Never publish a citation with
unverified: true. Flag it and stop — the page cannot go to the validator until all citations are resolved. - Never use a URL that returns anything other than HTTP 200. Find the DOI.org equivalent or a stable alternative.
- If a citation cannot be grounded in a real, fetchable URL: remove the claim from the page rather than leave a fake URL in place. A page with fewer citations is better than a page with fabricated sources.
Step 0: Load blocked-sources registry (adaptive — skip known-dead URLs)
import json, os
blocked_path = 'data/audit/blocked-sources.json'
blocked_urls = set()
if os.path.exists(blocked_path):
blocked = json.load(open(blocked_path))
blocked_urls = set(blocked.get('urls', []))
print(f"Blocked-source registry: {len(blocked_urls)} known-dead URLs — will skip without fetching")
else:
print("No blocked-source registry yet — all URLs will be fetched")
After the full verification pass (Step 2–3), append any newly-failed URLs to the registry:
# Run after Step 3 — before generating the report
if newly_failed_urls:
blocked = json.load(open(blocked_path)) if os.path.exists(blocked_path) else {'urls': [], 'lastUpdated': ''}
added = 0
for url in newly_failed_urls:
if url not in blocked['urls']:
blocked['urls'].append(url)
added += 1
blocked['lastUpdated'] = date.today().isoformat()
json.dump(blocked, open(blocked_path, 'w'), indent=2)
print(f"Added {added} newly-failed URLs to blocked-source registry")
Step 1: Collect all citations in the file
Load the YAML. Collect every sources array entry from:
properties.materialCharacteristics.sourcesproperties.laserMaterialInteraction.sourcesfaq.items[*].sources- Any other field containing a
sourcesarray
Also collect dataCard.metrics[*] entries that have a source field — these follow the same verification process.
Application pages — root-level citations[] block:
If the YAML is under frontmatter/applications/, also collect every entry in the top-level citations: array:
import yaml
data = yaml.safe_load(open(path))
app_citations = data.get('citations', [])
if app_citations:
print(f'Application citations[]: {len(app_citations)} entries')
for i, c in enumerate(app_citations):
url = c.get('url', '')
short = c.get('shortName', f'[{i}]')
claim = c.get('claim', '')
_placeholder_patterns = ['example.com', '[doi]', 'your-url', 'placeholder', 'insert-url']
_is_placeholder = not url.startswith('https://') or any(p in url.lower() for p in _placeholder_patterns)
if _is_placeholder:
print(f'FAIL: citations[{i}] ({short}): fake or placeholder URL — {url[:60]} — remove citation or find real source')
else:
urls_to_check.append({'path': f'citations[{i}]', 'shortName': short, 'url': url, 'claim': claim})
elif path.startswith('frontmatter/applications/'):
print('FAIL: application page has no citations[] block — Dim 3 will fail')
print('Action: run zbeam-page-updater Step 2h or add citations manually from diff brief')
Application citations[] entries follow the same URL-fetch verification as material sources entries (Step 2–3). claim must match verbatim or near-verbatim text found on the source page.
Inline citation audit — pageDescription (top-level):
pageDescription carries inline (Author, Year) references but has no own sources block. After collecting the structured sources above, scan pageDescription text for inline citation patterns:
import re, yaml
data = yaml.safe_load(open(path))
page_desc = data.get('pageDescription', '')
# Match parenthetical citations: (Author et al., 2023) or (JMRT, 2026) etc.
inline_refs = re.findall(r'\(([^)]{3,80}?(?:19|20)\d{2}[^)]*)\)', page_desc)
# Build index of known shortNames from structured sources
known = {}
for src_block in [data.get('properties', {}).get('materialCharacteristics', {}).get('sources', []),
data.get('properties', {}).get('laserMaterialInteraction', {}).get('sources', [])]:
for s in src_block:
sn = s.get('shortName', '')
cit = s.get('citation', '')
if sn: known[sn.lower()] = s
# Extract year+author fragment for fuzzy match
m = re.search(r'(\w+)[,\s]+(\d{4})', cit)
if m: known[f"{m.group(1).lower()} {m.group(2)}"] = s
for ref in inline_refs:
matched = any(k in ref.lower() for k in known)
if not matched:
print(f'UNANCHORED in pageDescription: ({ref}) — no matching sources entry found')
For each unanchored inline reference in pageDescription, flag it in the verification report under "pageDescription citation gaps". These must either be resolved by adding a matching entry to the nearest relevant sources block, or the inline reference must be removed.
Build a deduplicated list of unique URLs to check.
Steps 2–3: Verify and extract in a single fetch pass
Do not fetch each URL twice. Combine status check and quote extraction into one loop — fetch once, do both operations on the returned content.
import re
def verify_and_extract(url, claim):
"""Fetch URL once. Return (status, verbatim_quote_or_None, working_url)."""
try:
# Use mcp__workspace__web_fetch — returns content or error
result = fetch(url) # web_fetch call
if not result or 'error' in result.lower()[:50]:
return 'ERROR', None, url
# Status is 200 if content returned — extract quote in same pass
key_terms = re.findall(r'[\d.]+(?:\s*[–-]\s*[\d.]+)?', claim)
sentences = re.split(r'(?<=[.!?])\s+', result)
for term in key_terms:
matching = [s for s in sentences if term.replace('–','-') in s.replace('–','-')]
if matching:
quote = ' '.join(matching[0].split()[:30])
return '200', quote, url
# Content fetched but quote not found (JS-rendered or claim mismatch)
return '200-no-quote', None, url
except Exception:
return 'ERROR', None, url
For each unique URL in the citations list:
- Call
verify_and_extract(url, claim) - If status
ERROR→ try DOI redirect or PubMed mirror, then call again with working URL - If status
200with quote → setverified: true, updateclaimto verbatim quote - If status
200-no-quote→ markunverified: true, note "content fetched but claim text not locatable — manual verification required" - If no working URL found → mark
unverified: true
If URL fails or returns non-200:
- Try DOI: if URL contains a DOI pattern, fetch
https://doi.org/[doi] - Try PubMed/ResearchGate/MDPI mirror search
- If working URL found: update
urlfield in YAML and retry - If not: mark
unverified: true
Quote format (YAML-safe):
claim: '"complete removal threshold range was determined to be 3.34–3.76 J/cm²"'
Outer single quotes wrap inner double-quoted verbatim text.
Step 4: Update the YAML
For each citation, write back:
url: verified working URL (updated if original was broken)claim: verbatim quoted text from source, ≤30 words, in double quotesverified: trueif URL returned 200 and quote was extractedunverified: trueif either check failed
Use surgical string replacement — never yaml.dump().
Validate after every write:
python3 -c "import yaml; yaml.safe_load(open('[path]'))" && echo "valid"
Step 5: Generate verification report
Save to: data/audit/citations-[slug]-[YYYY-MM-DD].md
# Citation Verification Report: [slug]
**Date**: [date]
**Total citations checked**: [n]
**Verified**: [n]
**Unverified**: [n] — page cannot go to validator until resolved
## Verified citations
| Field | URL status | Quote extracted |
|---|---|---|
| properties.laserMaterialInteraction.sources[0] | 200 ✅ | "...verbatim quote..." |
## Unverified citations — BLOCKING
| Field | URL | Issue | Resolution required |
|---|---|---|---|
| faq.items[2].sources[0] | 404 ❌ | DOI redirect failed | Find working mirror or remove citation |
## URL corrections made
[List any URLs that were updated to DOI equivalents or mirrors]
What this skill does not do
- Does not write new citations — only verifies and hardens existing ones
- Does not remove citations that fail — marks them
unverified: truefor human resolution - Does not pass a page to the validator if any citation is unverified
- Does not accept paraphrased or summarized quotes — verbatim only
Step 5b: Write researcher fallback request
Read references/citation-protocol.md. Run this step when any citations remain unverified: true after Step 4 — it writes a targeted re-sourcing request for the content-researcher and includes the researcher pickup check code.