name: wp-static-clone description: > Clones a live WordPress (or other CMS-driven) site into a static HTML site deployable on any static host (Cloudflare Pages, Netlify, Vercel, S3+CloudFront, plain Apache/nginx). Use when the user wants to "scrape", "freeze", "archive", "static-ify", or "move to [host]" a WordPress site, or asks to turn a sitemap into deployable static HTML. Pulls every URL from sitemap_index.xml, fetches all assets, rewrites paths to be root-relative, strips WP runtime markup, and outputs a flat directory ready to deploy with no build command.
wp-static-clone
Turn a live WordPress site into a static HTML clone deployable on any static host. Driven by the site's XML sitemap. Handles the WordPress-specific gotchas — Cloudflare bot protection, mid-scrape link rewriting, proxied analytics, R2-offloaded uploads, comment-form runtime, Yoast attribution, Gravatar privacy — that a naïve wget run misses.
Recipes live in AGENTS.md; reusable scripts in scripts/. This file is the workflow, gotchas, and output structure.
When to use
Trigger on requests like:
- "Scrape this WordPress site for [host]"
- "Freeze [domain] as static HTML"
- "Pull all the pages from this sitemap and turn them into static files"
- "Move this WP site to [host] with no build step"
The broad shape (sitemap → wget → root-relative paths → static host) generalises to any CMS that emits a standard XML sitemap. The runtime cleanup (comment forms, Plausible proxy, Gravatar, Yoast) is WordPress-specific.
Workflow
Phase 0 — Confirm intent
Before scraping, confirm:
- Source URL (the live site).
- Target host — Cloudflare Pages, Netlify, Vercel, generic static. Drives Phase 9.
- Same or different domain at the destination. Drives whether
og:url,<link rel="canonical">, and JSON-LD@idstay absolute (same domain — correct SEO behaviour) or get rewritten (different domain). - What to do with analytics and forms. WP plugins for both can't run statically. Plausible gets replaced with the standard tracker (recipe in
AGENTS.md); contact/search forms either get removed or wired through Pages Functions / Formspree / Netlify Forms — host-specific.
Phase 1 — Discover URLs and pull XML sitemaps
Fetch the sitemap index. Try <root>/sitemap_index.xml (Yoast convention) first, then <root>/sitemap.xml. If the index references sub-sitemaps (page-sitemap.xml, post-sitemap.xml, …), fetch each and concatenate <loc> values into urls.txt. Skip image-sitemap entries.
Also fetch the XML sitemaps themselves and the Yoast XSL stylesheet now (recipe in AGENTS.md) — they aren't linked from HTML, so wget -p won't find them later.
Phase 2 — Scrape in one shot
Critical: scrape every URL in a single wget invocation so its --convert-links pass sees all downloaded files and rewrites cross-page links correctly. Scraping URLs in separate runs leaves residual absolute links on whichever page was scraped first/last. Recipe in AGENTS.md.
Phase 3 — Pull assets the page-requisites pass missed
Some assets aren't -p-followed because they appear only in og:image, apple-touch-icon, JSON-LD image/logo, msapplication-TileImage, or <link rel="modulepreload">. Audit and fetch the long tail. Recipe in AGENTS.md covers all three asset roots (uploads, themes, plugins).
Phase 4 — Convert paths to root-relative
wget -k produces a mix of ../wp-content/... (depth-relative) and bare wp-content/... (homepage). Both work locally but break the moment a page moves. Convert to root-relative /wp-content/... everywhere:
python3 scripts/rewrite-paths.py output/ urls.txt --source-domain example.com
The script derives the page-slug list from urls.txt, not from a directory walk — otherwise wget-grabbed archive directories like category/, feed/, author/, wp-json/ get wrongly classified as pages and their inter-page links get mis-rewritten.
The script defaults to WordPress asset roots (wp-content, wp-includes). For non-WP sources, override with --asset-roots: e.g. --asset-roots sites/default/files,sites/default/themes for Drupal, --asset-roots content/images for Ghost. The rest of the rewriter is CMS-agnostic.
Phase 5 — Brand the static output
So future-you (or anyone reading view-source) can tell at a glance that this is the static clone, not the live WP install:
python3 scripts/insert-banner.py output/
Inserts an HTML comment after <!DOCTYPE html> on every page. Idempotent. Then replace the "Generated by Yoast SEO" attribution in wp-content/plugins/wordpress-seo/css/main-sitemap.xsl — recipe in AGENTS.md.
Phase 6 — Replace WP runtime hooks
Three categories of WP-only markup that breaks once the backend is gone:
1. Comment forms, reply links, and dead head tags. One pass:
python3 scripts/strip-wp-runtime.py output/
Removes <div id="respond"> blocks (the comment form), comment-reply-link anchors in both block-theme and classic-theme variants, the comment-reply-js script tag and its underlying file, and dead <head> tags (REST API discovery, RSD, oEmbed alternates, RSS alternates, archive next links). Match-by-class throughout — no language assumptions about link text.
2. Plausible analytics proxy. The WP plugin proxies the script through /wp-content/uploads/<hash>/pa-XXX.js and posts events back to /wp-json/.... Both endpoints disappear. Replace the two-script block with the standard tracker — recipe in AGENTS.md.
3. Gravatar avatars. Self-host every distinct (hash, size) pair, drop the ?s=N&d=mm requests to a third party:
python3 scripts/selfhost-gravatars.py output/
Saves under avatars/ and rewrites every reference. Detects extension from response bytes (PNG fallback vs JPEG real avatar), keeps size variants separate (?s=40 and ?s=80 are different files).
After the scripts, audit remaining absolute source-domain URLs (recipe in AGENTS.md) and triage by case: author archives → strip the <a> wrapper, server-rendered iframes → drop the wrapping <p>, Gravity Forms script blocks → strip on gform-mention, etc.
Phase 7 — Copy robots.txt
Not linked from HTML; fetch it explicitly. Adjust the Sitemap: reference if the deployed sitemap path differs from the source.
Phase 8 — Verify locally
Serve from output/ with python3 -m http.server, then run the verify checklist in AGENTS.md:
- Every URL in
urls.txtresolves to a file (no missed pages). - No remaining
https://<source-domain>/outside the canonical /og:url/ JSON-LD allow-list. - No broken internal links from
wget --spider. - Spot-check the homepage and a deep page in a browser. Watch srcset images, sidebar widgets, and the header banner — those break silently if missed.
Phase 9 — Deploy
Host-specific recipes in AGENTS.md:
- Cloudflare Pages —
_redirects,_headers, "no build command, no output directory" defaults. - Netlify — same
_redirects/_headerssyntax, plusnetlify.toml. - Vercel —
vercel.jsonwithredirects/headers. - Generic — nginx
try_files, ApacheOptions +MultiViews.
Gotchas
These are the things that bit us. Don't repeat them.
Cloudflare bot protection 403s the default
Wget/1.xUA. Always set a real browser UA +Accept/Accept-Languageheaders (recipe). If you see403 Forbiddenafter a burst of requests, that's it — back off, switch UA, retry.Cross-page link rewriting only works in a single wget invocation. wget's
-konly rewrites to local paths it sees in the current run. If a page was downloaded in a separate invocation (e.g. to recover from a 403 on one URL), its links to the rest stay absolute. Solution: redo the full scrape once you have the right UA. Don't piecemeal it. If you're scraping at scale (10K+ URLs) and can't fit in one run, scrape in batches and re-runscripts/rewrite-paths.pyafterwards as the canonical pass —-k's output is then redundant.Default publish directory by host. Cloudflare Pages serves the repo root when no build command is configured. Netlify and Vercel also default to root. If you scraped into
output/, either move files to the repo root (git mv output/* .) or configure the host to publish fromoutput/. Symptom of the wrong setup on Pages: every URL 404s with R2-style headers (access-control-allow-origin: *,cache-control: no-store) instead of a Pages-branded 404.WordPress Offload Media plugins route
/wp-content/uploads/to R2 / S3 buckets. wget may successfully fetch an image even when later direct access 404s (intermittent or partial bucket sync). Trust your local copy — that's why we scrape and self-host.Sitemaps and the Yoast XSL aren't linked from HTML. wget
-pwon't find them. Fetch explicitly in Phase 1.Filenames with
?ver=...query strings. wget keeps these as literal filenames; HTML uses%3Fencoding. Standard servers (Pages, Netlify, Vercel,python -m http.server) URL-decode and serve correctly. Don't try to "clean these up" unless something actually breaks.og:url, canonical, JSON-LD stay absolute. They identify the canonical resource and are correct as-is when redeploying to the same domain. Only rewrite if changing domains.sed -i ''is macOS / BSD only. GNU sed needssed -i(no empty-string argument). Recipes inAGENTS.mdflag the macOS-isms; default to the Python scripts where there's a choice — they're portable.
Output structure
<repo-root>/
index.html ← homepage
<slug>/index.html ← one per URL from sitemap
wp-content/ ← assets (themes, uploads, plugins)
wp-includes/ ← block library CSS, et al.
avatars/ ← self-hosted Gravatars (Phase 6)
sitemap_index.xml
page-sitemap.xml ← + any other child sitemaps
wp-content/plugins/wordpress-seo/css/main-sitemap.xsl
robots.txt
_redirects ← optional, host-specific
_headers ← optional, host-specific
Push to a git host and connect to the static host with no build command and no build output directory — defaults work.