web-scraper

star 484

Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages.

AIDotNet By AIDotNet schedule Updated 3/20/2026

name: web-scraper description: Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages. compatibility: Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search.

Web Scraper

Fetch, search, and extract content from websites.

When to use this skill

  • User asks to fetch or read a webpage / URL
  • User wants to search the internet for information
  • User needs to extract links, tables, or structured data from a website
  • User asks to crawl a JavaScript-rendered (dynamic) page
  • User wants web content converted to clean Markdown for analysis

Scripts overview

Script Purpose Dependencies
fetch_page.py Fetch a URL and extract readable content as Markdown requests, beautifulsoup4, readability-lxml, html2text
search_web.py Search the web via DuckDuckGo ddgs
crawl_dynamic.py Crawl JS-rendered pages with a headless browser crawl4ai
extract_links.py Extract and categorize all links from a page requests, beautifulsoup4

Steps

1. Install dependencies (first time only)

For lightweight scraping (static pages, search, link extraction):

pip install requests beautifulsoup4 readability-lxml html2text ddgs

For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):

pip install crawl4ai
crawl4ai-setup

Note: crawl4ai-setup downloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.

CRITICAL — Dependency Error Recovery: If ANY script below fails with an ImportError or "module not found" error, install the missing dependencies using the command above, then re-run the EXACT SAME script command that failed. Do NOT write inline Python code (python -c "...") or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.

2. Fetch a web page (static — recommended first choice)

Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.

python scripts/fetch_page.py "URL"

Options:

  • --raw — Output full page Markdown instead of extracted article content
  • --selector "CSS_SELECTOR" — Extract only elements matching the CSS selector (e.g. ".article-body", "table", "#content")
  • --save OUTPUT_PATH — Also save output to a file
  • --max-length N — Truncate output to N characters (default: no limit)

Examples:

# Fetch an article
python fetch_page.py "https://example.com/article"

# Extract only tables
python fetch_page.py "https://example.com/data" --selector "table"

# Fetch raw full-page markdown, limit to 5000 chars
python fetch_page.py "https://example.com" --raw --max-length 5000

3. Search the web

Search using DuckDuckGo (no API key required).

python scripts/search_web.py "search query"

Options:

  • --max-results N — Number of results to return (default: 10)
  • --region REGION — Region code, e.g. cn-zh, us-en, jp-jp (default: wt-wt for worldwide)
  • --news — Search news instead of general web

Examples:

# General search
python search_web.py "Python web scraping best practices 2025"

# News search, Chinese region, 5 results
python search_web.py "AI 最新进展" --news --region cn-zh --max-results 5

4. Crawl a dynamic / JavaScript-rendered page

Use this only when fetch_page.py returns empty or incomplete content (SPA, React/Vue apps, pages that load content via JS).

python scripts/crawl_dynamic.py "URL"

Options:

  • --wait N — Wait N seconds after page load for JS to finish (default: 3)
  • --selector "CSS_SELECTOR" — Wait for a specific element to appear before extracting
  • --scroll — Scroll to bottom of page to trigger lazy loading
  • --save OUTPUT_PATH — Also save output to a file
  • --max-length N — Truncate output to N characters

5. Extract links from a page

Extract all links with their text labels, categorized by type (internal, external, resource).

python scripts/extract_links.py "URL"

Options:

  • --filter PATTERN — Only show links matching a regex pattern (applied to URL)
  • --external-only — Only show external links
  • --json — Output as JSON instead of Markdown

Decision guide: which script to use

  1. Start with fetch_page.py — handles 90% of websites (articles, docs, blogs, wikis).
  2. If fetch_page.py returns empty/garbled content → try crawl_dynamic.py (the page likely needs JavaScript).
  3. Need to find URLs first? → Use search_web.py to discover relevant pages.
  4. Need to navigate a site structure? → Use extract_links.py to map out links, then fetch individual pages.

Common workflows

Research a topic

  1. search_web.py "topic" → get relevant URLs
  2. fetch_page.py "best_url" → read the content
  3. Repeat for multiple sources, then synthesize

Scrape structured data from a page

  1. fetch_page.py "url" --selector "table" → extract tables
  2. Or fetch_page.py "url" --selector ".product-card" → extract specific elements

Crawl a modern web app (SPA)

  1. crawl_dynamic.py "url" --wait 5 --scroll → full JS-rendered content

Edge cases

  • Paywalled sites: May return partial content or login pages. Inform the user.
  • Rate limiting / CAPTCHAs: If requests fail with 403/429, wait and retry or inform the user.
  • Very large pages: Use --max-length to truncate output and avoid overwhelming the context window.
  • Encoding issues: Scripts handle UTF-8 by default. Exotic encodings may need manual adjustment.
  • Robots.txt: These scripts do not check robots.txt. Use responsibly and respect website terms of service.

Scripts

Install via CLI
npx skills add https://github.com/AIDotNet/OpenCowork --skill web-scraper
Repository Details
star Stars 484
call_split Forks 98
navigation Branch main
article Path SKILL.md
More from Creator