web-browsing-routing-and-sites - SKILL.md Agent Skill

name: web-browsing-routing-and-sites description: > Nested web-browsing reference for auto-tier decisions, per-site tier recommendations, known limitations/gotchas, and real-time data endpoints. version: 1.0.0

Web Browsing Routing and Site Reference

Nested web-browsing reference. Open this when the auto-tier extractor misroutes a site, when you know the site class, or when you need real-time data endpoints.

Auto-Tier Decision Tree

The bundled extract_page.py script auto-selects the cheapest viable tier.完整逻辑见 scripts/extract_page.py 的 auto_tier() 函数。

分层规则概要：

URL 特征	分配 Tier
PDF (.pdf 后缀或 /pdf/ 路径)	Tier 0
学术 API (arXiv, CrossRef, DOI, PubMed, etc.)	Tier 1
静态内容站 (Wikipedia, BBC, Stack Overflow, etc.)	Tier 1.5
需结构化提取 (GitHub, Reddit, Nature)	Tier 2
需 JS 渲染/反爬 (Scholar, Springer, Medium, Reuters)	Tier 3
其他 → 默认 Tier 1.5 (trafilatura)	Tier 1.5

Per-Site Tier Recommendations

Site	Recommended tier	Success rate	Notes
arXiv abstract	Tier 1	high	API call via `requests`
arXiv PDF	Tier 0	high	curl -L + fitz
OpenAlex / CrossRef	Tier 1	high	Fully free, most reliable
Unpaywall	Tier 1	high	Finds OA PDF for any DOI
DBLP	Tier 1	high	CS papers, conference proceedings
CORE	Tier 1	high	OA full text (30M+ papers)
Europe PMC	Tier 1	high	Biomedical + PMC full text
Papers With Code	Tier 1	high	ML papers with code
Google Scholar list	Tier 2	medium-high	curl + BS, needs clean IP
Nature.com	Tier 2/3	medium	og meta cheap; full body needs JS
Springer paywalled	Tier 3	low	Needs cookies/session
Medium / Substack	Tier 1.5	high	trafilatura extracts clean text
Reddit	Tier 2	high	`.json` API or old.reddit.com + BS
GitHub	Tier 2	high	API or BS
Wikipedia	Tier 1	high	REST API `/page/summary/{title}`
Hacker News	Tier 1	high	Firebase API, fully free
Google News	Tier 2	high	RSS feed, free, no key
Twitter/X	Tier 3	low	Aggressive bot detection
LinkedIn	Tier 3	low	Requires login + stealth
Any generic article	Tier 1.5	high	trafilatura — your default

Known Limitations & Gotchas

Read this before fighting a site — most of these are paired with the per-site table above.

Major publishers (Wiley / Science / PNAS / Elsevier): almost always return 403; APIs are the only practical route. Use Unpaywall to find OA versions.
Nature.com: do NOT use networkidle with Playwright — it will time out. Use domcontentloaded.
Google Scholar: rapid requests get IP-blocked; pace with time.sleep(2). Better: use SerpAPI/Serper.
Semantic Scholar API: needs free API key for usable rate limits (otherwise 100 req/5min).
PDF links on arXiv: the abstract page does NOT contain a direct PDF link. Derive: /pdf/{ID}.pdf.
Jina Reader: 20 req/min free tier. For heavy use, get an API key.
Reddit: must include a descriptive User-Agent header. Rate limit: ~60 req/min.
Medium paywall: trafilatura often extracts full text even from paywalled articles. If not, try Jina Reader.
DuckDuckGo search: no API key needed but rate-limited. Use responsibly.
CORE API: requires free API key from https://core.ac.uk/services/api for reasonable limits.

Real-Time Data Quick Reference

Source	Method	Free?	Endpoint / Pattern
Google News	RSS	✅	`https://news.google.com/rss/search?q={query}`
Reddit	JSON API	✅	Append `.json` to any URL + User-Agent header
Hacker News	Firebase API	✅	`https://hacker-news.firebaseio.com/v0/topstories.json`
GitHub	REST API	✅*	`https://api.github.com/search/repositories?q={q}&sort=stars`
Stack Exchange	API	✅	`https://api.stackexchange.com/2.3/search?intitle={q}&site=stackoverflow`
Wikipedia	REST API	✅	`https://en.wikipedia.org/api/rest_v1/page/summary/{title}`
Wayback Machine	API	✅	`https://archive.org/wayback/available?url={url}`
Stock data	yfinance	✅	`yf.Ticker("AAPL").history(period="1mo")`
Weather	Open-Meteo	✅	`https://api.open-meteo.com/v1/forecast?...`

→ Deep-dive: reference/realtime-data.md → News/RSS guide: reference/news-and-rss.md → Social media: reference/social-media.md