name: web-scraping description: > Web scraping guide for sub-agents. Covers Firecrawl CLI fallback scraping when WebFetch fails (JS-heavy sites, anti-bot walls, 403 errors, empty content) and advanced capabilities like structured data extraction with Zod schemas, multi-page crawls, and search-plus-scrape. Use when WebFetch returns garbage or empty pages, when you need typed data from a page (prices, features, specs), or when you need to ingest multiple pages from a site. user-invocable: false
Web Scraping Field Card
Required tools for consuming agents: WebFetch, Bash(bunx firecrawl-cli *), Read
Integration: Any newsroom sub-agent should consult this skill when WebFetch fails or when structured/multi-page scraping is needed.
What Do You Need?
| Need | Tool | Details |
|---|---|---|
| Page content as markdown | WebFetch first, then Firecrawl CLI | See below |
| Structured data from a page (prices, features, specs) | Firecrawl extract | Read references/structured-extraction.md |
| Multiple pages from one site | Firecrawl crawl | Read references/crawling.md |
| Search the web + scrape results | Firecrawl search | Read references/crawling.md |
Getting Page Content
Step 1: Try WebFetch First
WebFetch is free, fast, and already available. Use it by default.
Works for: blogs, news articles, documentation, static pages, most forum threads.
Step 2: Recognize Failure
Switch to Firecrawl CLI when WebFetch returns:
- Empty or near-empty content (page requires JavaScript rendering)
- 403/429 errors (anti-bot protection)
- Mangled HTML with no useful text (client-side rendered SPA)
- Login walls or cookie consent overlays blocking content
Do NOT retry WebFetch on the same URL -- it will fail again.
Step 3: Firecrawl CLI Scrape
Requires: firecrawl-cli (install: npm install -g firecrawl-cli or use via bunx firecrawl-cli). Authenticates via FIRECRAWL_API_KEY env var or firecrawl auth --api-key <key>.
If firecrawl-cli is not installed or FIRECRAWL_API_KEY is unset, skip to Step 4 (Report Gaps). Do not retry or attempt workarounds.
Output to stdout (default -- pipe or capture as needed):
bunx firecrawl-cli scrape "<url>"
Output to file (more token-efficient -- read from disk instead of context):
bunx firecrawl-cli scrape "<url>" -o /tmp/scrape-output.md
Then use the Read tool on /tmp/scrape-output.md to pull only what you need into context.
Handles: JS rendering, dynamic content, basic anti-bot bypass, clean Markdown output (strips nav, headers, footers with --only-main-content).
Does NOT handle: login-gated content, CAPTCHAs, form filling, aggressive Cloudflare Turnstile.
For multiple URLs, scrape each separately to different files:
bunx firecrawl-cli scrape "<url1>" -o /tmp/scrape-1.md
bunx firecrawl-cli scrape "<url2>" -o /tmp/scrape-2.md
The CLI is beta (released Jan 2026) -- expect quirks and flag changes. Run bunx firecrawl-cli scrape --help for current options.
Step 4: Report Gaps Honestly
If both WebFetch and Firecrawl fail:
- Note which URL was inaccessible and why
- Do not fabricate content or silently skip the source
- Move on to other sources