web-scraping

star 0

Web scraping guide for sub-agents. Covers Firecrawl CLI fallback scraping when WebFetch fails (JS-heavy sites, anti-bot walls, 403 errors, empty content) and advanced capabilities like structured data extraction with Zod schemas, multi-page crawls, and search-plus-scrape. Use when WebFetch returns garbage or empty pages, when you need typed data from a page (prices, features, specs), or when you need to ingest multiple pages from a site.

nathanvale By nathanvale schedule Updated 2/17/2026

name: web-scraping description: > Web scraping guide for sub-agents. Covers Firecrawl CLI fallback scraping when WebFetch fails (JS-heavy sites, anti-bot walls, 403 errors, empty content) and advanced capabilities like structured data extraction with Zod schemas, multi-page crawls, and search-plus-scrape. Use when WebFetch returns garbage or empty pages, when you need typed data from a page (prices, features, specs), or when you need to ingest multiple pages from a site. user-invocable: false

Web Scraping Field Card

Required tools for consuming agents: WebFetch, Bash(bunx firecrawl-cli *), Read

Integration: Any newsroom sub-agent should consult this skill when WebFetch fails or when structured/multi-page scraping is needed.

What Do You Need?

Need Tool Details
Page content as markdown WebFetch first, then Firecrawl CLI See below
Structured data from a page (prices, features, specs) Firecrawl extract Read references/structured-extraction.md
Multiple pages from one site Firecrawl crawl Read references/crawling.md
Search the web + scrape results Firecrawl search Read references/crawling.md

Getting Page Content

Step 1: Try WebFetch First

WebFetch is free, fast, and already available. Use it by default.

Works for: blogs, news articles, documentation, static pages, most forum threads.

Step 2: Recognize Failure

Switch to Firecrawl CLI when WebFetch returns:

  • Empty or near-empty content (page requires JavaScript rendering)
  • 403/429 errors (anti-bot protection)
  • Mangled HTML with no useful text (client-side rendered SPA)
  • Login walls or cookie consent overlays blocking content

Do NOT retry WebFetch on the same URL -- it will fail again.

Step 3: Firecrawl CLI Scrape

Requires: firecrawl-cli (install: npm install -g firecrawl-cli or use via bunx firecrawl-cli). Authenticates via FIRECRAWL_API_KEY env var or firecrawl auth --api-key <key>.

If firecrawl-cli is not installed or FIRECRAWL_API_KEY is unset, skip to Step 4 (Report Gaps). Do not retry or attempt workarounds.

Output to stdout (default -- pipe or capture as needed):

bunx firecrawl-cli scrape "<url>"

Output to file (more token-efficient -- read from disk instead of context):

bunx firecrawl-cli scrape "<url>" -o /tmp/scrape-output.md

Then use the Read tool on /tmp/scrape-output.md to pull only what you need into context.

Handles: JS rendering, dynamic content, basic anti-bot bypass, clean Markdown output (strips nav, headers, footers with --only-main-content).

Does NOT handle: login-gated content, CAPTCHAs, form filling, aggressive Cloudflare Turnstile.

For multiple URLs, scrape each separately to different files:

bunx firecrawl-cli scrape "<url1>" -o /tmp/scrape-1.md
bunx firecrawl-cli scrape "<url2>" -o /tmp/scrape-2.md

The CLI is beta (released Jan 2026) -- expect quirks and flag changes. Run bunx firecrawl-cli scrape --help for current options.

Step 4: Report Gaps Honestly

If both WebFetch and Firecrawl fail:

  • Note which URL was inaccessible and why
  • Do not fabricate content or silently skip the source
  • Move on to other sources
Install via CLI
npx skills add https://github.com/nathanvale/side-quest-plugins --skill web-scraping
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator