wp-static-clone

name: wp-static-clone description: > Clones a live WordPress (or other CMS-driven) site into a static HTML site deployable on any static host (Cloudflare Pages, Netlify, Vercel, S3+CloudFront, plain Apache/nginx). Use when the user wants to "scrape", "freeze", "archive", "static-ify", or "move to [host]" a WordPress site, or asks to turn a sitemap into deployable static HTML. Pulls every URL from sitemap_index.xml, fetches all assets, rewrites paths to be root-relative, strips WP runtime markup, and outputs a flat directory ready to deploy with no build command.

Turn a live WordPress site into a static HTML clone deployable on any static host. Driven by the site's XML sitemap. Handles the WordPress-specific gotchas — Cloudflare bot protection, mid-scrape link rewriting, proxied analytics, R2-offloaded uploads, comment-form runtime, Yoast attribution, Gravatar privacy — that a naïve wget run misses.

Recipes live in AGENTS.md; reusable scripts in scripts/. This file is the workflow, gotchas, and output structure.

When to use

Trigger on requests like:

"Scrape this WordPress site for [host]"
"Freeze [domain] as static HTML"
"Pull all the pages from this sitemap and turn them into static files"
"Move this WP site to [host] with no build step"

The broad shape (sitemap → wget → root-relative paths → static host) generalises to any CMS that emits a standard XML sitemap. The runtime cleanup (comment forms, Plausible proxy, Gravatar, Yoast) is WordPress-specific.

Workflow

Phase 0 — Confirm intent

Before scraping, confirm:

Source URL (the live site).
Target host — Cloudflare Pages, Netlify, Vercel, generic static. Drives Phase 9.
Same or different domain at the destination. Drives whether og:url, <link rel="canonical">, and JSON-LD @id stay absolute (same domain — correct SEO behaviour) or get rewritten (different domain).
What to do with analytics and forms. WP plugins for both can't run statically. Plausible gets replaced with the standard tracker (recipe in AGENTS.md); contact/search forms either get removed or wired through Pages Functions / Formspree / Netlify Forms — host-specific.

Phase 1 — Discover URLs and pull XML sitemaps

Fetch the sitemap index. Try <root>/sitemap_index.xml (Yoast convention) first, then <root>/sitemap.xml. If the index references sub-sitemaps (page-sitemap.xml, post-sitemap.xml, …), fetch each and concatenate <loc> values into urls.txt. Skip image-sitemap entries.

Also fetch the XML sitemaps themselves and the Yoast XSL stylesheet now (recipe in AGENTS.md) — they aren't linked from HTML, so wget -p won't find them later.

Phase 2 — Scrape in one shot

Critical: scrape every URL in a single wget invocation so its --convert-links pass sees all downloaded files and rewrites cross-page links correctly. Scraping URLs in separate runs leaves residual absolute links on whichever page was scraped first/last. Recipe in AGENTS.md.

Phase 3 — Pull assets the page-requisites pass missed

Some assets aren't -p-followed because they appear only in og:image, apple-touch-icon, JSON-LD image/logo, msapplication-TileImage, or <link rel="modulepreload">. Audit and fetch the long tail. Recipe in AGENTS.md covers all three asset roots (uploads, themes, plugins).

Phase 4 — Convert paths to root-relative

wget -k produces a mix of ../wp-content/... (depth-relative) and bare wp-content/... (homepage). Both work locally but break the moment a page moves. Convert to root-relative /wp-content/... everywhere:

python3 scripts/rewrite-paths.py output/ urls.txt --source-domain example.com

The script derives the page-slug list from urls.txt, not from a directory walk — otherwise wget-grabbed archive directories like category/, feed/, author/, wp-json/ get wrongly classified as pages and their inter-page links get mis-rewritten.

The script defaults to WordPress asset roots (wp-content, wp-includes). For non-WP sources, override with --asset-roots: e.g. --asset-roots sites/default/files,sites/default/themes for Drupal, --asset-roots content/images for Ghost. The rest of the rewriter is CMS-agnostic.

Phase 5 — Brand the static output

So future-you (or anyone reading view-source) can tell at a glance that this is the static clone, not the live WP install:

python3 scripts/insert-banner.py output/

Inserts an HTML comment after <!DOCTYPE html> on every page. Idempotent. Then replace the "Generated by Yoast SEO" attribution in wp-content/plugins/wordpress-seo/css/main-sitemap.xsl — recipe in AGENTS.md.

Phase 6 — Replace WP runtime hooks

Three categories of WP-only markup that breaks once the backend is gone:

1. Comment forms, reply links, and dead head tags. One pass:

python3 scripts/strip-wp-runtime.py output/

Removes <div id="respond"> blocks (the comment form), comment-reply-link anchors in both block-theme and classic-theme variants, the comment-reply-js script tag and its underlying file, and dead <head> tags (REST API discovery, RSD, oEmbed alternates, RSS alternates, archive next links). Match-by-class throughout — no language assumptions about link text.

2. Plausible analytics proxy. The WP plugin proxies the script through /wp-content/uploads/<hash>/pa-XXX.js and posts events back to /wp-json/.... Both endpoints disappear. Replace the two-script block with the standard tracker — recipe in AGENTS.md.

3. Gravatar avatars. Self-host every distinct (hash, size) pair, drop the ?s=N&d=mm requests to a third party:

python3 scripts/selfhost-gravatars.py output/

Saves under avatars/ and rewrites every reference. Detects extension from response bytes (PNG fallback vs JPEG real avatar), keeps size variants separate (?s=40 and ?s=80 are different files).

After the scripts, audit remaining absolute source-domain URLs (recipe in AGENTS.md) and triage by case: author archives → strip the <a> wrapper, server-rendered iframes → drop the wrapping <p>, Gravity Forms script blocks → strip on gform-mention, etc.

Phase 7 — Copy `robots.txt`

Not linked from HTML; fetch it explicitly. Adjust the Sitemap: reference if the deployed sitemap path differs from the source.

Phase 8 — Verify locally

Serve from output/ with python3 -m http.server, then run the verify checklist in AGENTS.md:

Every URL in urls.txt resolves to a file (no missed pages).
No remaining https://<source-domain>/ outside the canonical / og:url / JSON-LD allow-list.
No broken internal links from wget --spider.
Spot-check the homepage and a deep page in a browser. Watch srcset images, sidebar widgets, and the header banner — those break silently if missed.

Phase 9 — Deploy

Host-specific recipes in AGENTS.md:

Cloudflare Pages — _redirects, _headers, "no build command, no output directory" defaults.
Netlify — same _redirects / _headers syntax, plus netlify.toml.
Vercel — vercel.json with redirects / headers.
Generic — nginx try_files, Apache Options +MultiViews.

Gotchas

These are the things that bit us. Don't repeat them.

Cloudflare bot protection 403s the default Wget/1.x UA. Always set a real browser UA + Accept / Accept-Language headers (recipe). If you see 403 Forbidden after a burst of requests, that's it — back off, switch UA, retry.
Cross-page link rewriting only works in a single wget invocation. wget's -k only rewrites to local paths it sees in the current run. If a page was downloaded in a separate invocation (e.g. to recover from a 403 on one URL), its links to the rest stay absolute. Solution: redo the full scrape once you have the right UA. Don't piecemeal it. If you're scraping at scale (10K+ URLs) and can't fit in one run, scrape in batches and re-run scripts/rewrite-paths.py afterwards as the canonical pass — -k's output is then redundant.
Default publish directory by host. Cloudflare Pages serves the repo root when no build command is configured. Netlify and Vercel also default to root. If you scraped into output/, either move files to the repo root (git mv output/* .) or configure the host to publish from output/. Symptom of the wrong setup on Pages: every URL 404s with R2-style headers (access-control-allow-origin: *, cache-control: no-store) instead of a Pages-branded 404.
WordPress Offload Media plugins route /wp-content/uploads/ to R2 / S3 buckets. wget may successfully fetch an image even when later direct access 404s (intermittent or partial bucket sync). Trust your local copy — that's why we scrape and self-host.
Sitemaps and the Yoast XSL aren't linked from HTML. wget -p won't find them. Fetch explicitly in Phase 1.
Filenames with ?ver=... query strings. wget keeps these as literal filenames; HTML uses %3F encoding. Standard servers (Pages, Netlify, Vercel, python -m http.server) URL-decode and serve correctly. Don't try to "clean these up" unless something actually breaks.
og:url, canonical, JSON-LD stay absolute. They identify the canonical resource and are correct as-is when redeploying to the same domain. Only rewrite if changing domains.
sed -i '' is macOS / BSD only. GNU sed needs sed -i (no empty-string argument). Recipes in AGENTS.md flag the macOS-isms; default to the Python scripts where there's a choice — they're portable.

Output structure

<repo-root>/
  index.html                 ← homepage
  <slug>/index.html          ← one per URL from sitemap
  wp-content/                ← assets (themes, uploads, plugins)
  wp-includes/               ← block library CSS, et al.
  avatars/                   ← self-hosted Gravatars (Phase 6)
  sitemap_index.xml
  page-sitemap.xml           ← + any other child sitemaps
  wp-content/plugins/wordpress-seo/css/main-sitemap.xsl
  robots.txt
  _redirects                 ← optional, host-specific
  _headers                   ← optional, host-specific

Push to a git host and connect to the static host with no build command and no build output directory — defaults work.

wp-static-clone

wp-static-clone

When to use

Workflow

Phase 0 — Confirm intent

Phase 1 — Discover URLs and pull XML sitemaps

Phase 2 — Scrape in one shot

Phase 3 — Pull assets the page-requisites pass missed

Phase 4 — Convert paths to root-relative

Phase 5 — Brand the static output

Phase 6 — Replace WP runtime hooks

Phase 7 — Copy robots.txt

Phase 8 — Verify locally

Phase 9 — Deploy

Gotchas

Output structure

Phase 7 — Copy `robots.txt`