web-scraper - SKILL.md Agent Skill

name: web-scraper description: Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages. compatibility: Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search.

Web Scraper

Fetch, search, and extract content from websites.

When to use this skill

User asks to fetch or read a webpage / URL
User wants to search the internet for information
User needs to extract links, tables, or structured data from a website
User asks to crawl a JavaScript-rendered (dynamic) page
User wants web content converted to clean Markdown for analysis

Scripts overview

Script	Purpose	Dependencies
`fetch_page.py`	Fetch a URL and extract readable content as Markdown	`requests`, `beautifulsoup4`, `readability-lxml`, `html2text`
`search_web.py`	Search the web via DuckDuckGo	`ddgs`
`crawl_dynamic.py`	Crawl JS-rendered pages with a headless browser	`crawl4ai`
`extract_links.py`	Extract and categorize all links from a page	`requests`, `beautifulsoup4`

Steps

1. Install dependencies (first time only)

For lightweight scraping (static pages, search, link extraction):

pip install requests beautifulsoup4 readability-lxml html2text ddgs

For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):

pip install crawl4ai
crawl4ai-setup

Note: crawl4ai-setup downloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.

CRITICAL — Dependency Error Recovery: If ANY script below fails with an ImportError or "module not found" error, install the missing dependencies using the command above, then re-run the EXACT SAME script command that failed. Do NOT write inline Python code (python -c "...") or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.

2. Fetch a web page (static — recommended first choice)

Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.

python scripts/fetch_page.py "URL"

Options:

--raw — Output full page Markdown instead of extracted article content
--selector "CSS_SELECTOR" — Extract only elements matching the CSS selector (e.g. ".article-body", "table", "#content")
--save OUTPUT_PATH — Also save output to a file
--max-length N — Truncate output to N characters (default: no limit)

Examples:

# Fetch an article
python fetch_page.py "https://example.com/article"

# Extract only tables
python fetch_page.py "https://example.com/data" --selector "table"

# Fetch raw full-page markdown, limit to 5000 chars
python fetch_page.py "https://example.com" --raw --max-length 5000

3. Search the web

Search using DuckDuckGo (no API key required).

python scripts/search_web.py "search query"

Options:

--max-results N — Number of results to return (default: 10)
--region REGION — Region code, e.g. cn-zh, us-en, jp-jp (default: wt-wt for worldwide)
--news — Search news instead of general web

Examples:

# General search
python search_web.py "Python web scraping best practices 2025"

# News search, Chinese region, 5 results
python search_web.py "AI 最新进展" --news --region cn-zh --max-results 5

4. Crawl a dynamic / JavaScript-rendered page

Use this only when fetch_page.py returns empty or incomplete content (SPA, React/Vue apps, pages that load content via JS).

python scripts/crawl_dynamic.py "URL"

Options:

--wait N — Wait N seconds after page load for JS to finish (default: 3)
--selector "CSS_SELECTOR" — Wait for a specific element to appear before extracting
--scroll — Scroll to bottom of page to trigger lazy loading
--save OUTPUT_PATH — Also save output to a file
--max-length N — Truncate output to N characters

5. Extract links from a page

Extract all links with their text labels, categorized by type (internal, external, resource).

python scripts/extract_links.py "URL"

Options:

--filter PATTERN — Only show links matching a regex pattern (applied to URL)
--external-only — Only show external links
--json — Output as JSON instead of Markdown

Decision guide: which script to use

Start with fetch_page.py — handles 90% of websites (articles, docs, blogs, wikis).
If fetch_page.py returns empty/garbled content → try crawl_dynamic.py (the page likely needs JavaScript).
Need to find URLs first? → Use search_web.py to discover relevant pages.
Need to navigate a site structure? → Use extract_links.py to map out links, then fetch individual pages.

Common workflows

Research a topic

search_web.py "topic" → get relevant URLs
fetch_page.py "best_url" → read the content
Repeat for multiple sources, then synthesize

Scrape structured data from a page

fetch_page.py "url" --selector "table" → extract tables
Or fetch_page.py "url" --selector ".product-card" → extract specific elements

Crawl a modern web app (SPA)

crawl_dynamic.py "url" --wait 5 --scroll → full JS-rendered content

Edge cases

Paywalled sites: May return partial content or login pages. Inform the user.
Rate limiting / CAPTCHAs: If requests fail with 403/429, wait and retry or inform the user.
Very large pages: Use --max-length to truncate output and avoid overwhelming the context window.
Encoding issues: Scripts handle UTF-8 by default. Exotic encodings may need manual adjustment.
Robots.txt: These scripts do not check robots.txt. Use responsibly and respect website terms of service.

Scripts

fetch_page.py — Fetch and extract readable content as Markdown
search_web.py — Search the web via DuckDuckGo
crawl_dynamic.py — Crawl JavaScript-rendered pages
extract_links.py — Extract and categorize page links