name: active-crawl-wiki description: Scheduled autonomous wiki crawling to discover and document trending AI/ML developments version: 1.2.0 author: Hermes Agent license: MIT metadata: hermes: tags: [wiki, active-crawl, research, trending, ai-ml] category: research related_skills: [llm-wiki, web-extract, blogwatcher]
Active Crawl for AI/LLM Knowledge Wiki
Systematic approach for scheduled autonomous wiki crawling to discover and document trending AI/ML developments.
Trigger Conditions
- Scheduled cron job with no user-provided sources
- Wiki needs proactive gap discovery
- Emerging trends need documentation before they're widely covered
- Recurring "trending topics" report generation (e.g., daily or weekly)
Process
1. Trend Discovery (Hybrid Scanning)
Two complementary approaches — use BOTH for richest coverage:
A) Blogwatcher DB Scan
import sqlite3
# Discover articles in last 2-3 days
rows = conn.execute("""
SELECT b.name, a.title, a.url, a.published_date
FROM articles a JOIN blogs b ON a.blog_id = b.id
WHERE DATE(a.discovered_date) >= date('now', '-2 days')
ORDER BY a.discovered_date DESC
""").fetchall()
- Group by blog to spot clusters (same blog = same thought leader's hot topic)
- Check log.md to see if blog-ingest triage already processed today's batch — avoid re-analyzing
- If blogwatcher DB returns 0 results for today, the scan may not have run yet — use web_search as the primary source instead
- Reference:
references/blogwatcher-db-queries.mdfor schema, verified column names, and environment notes
B) Web Search (Broad Context)
- Run 3-5 targeted web_search queries: model releases, AI agent news, platform launches, research breakthroughs, Big Tech AI moves
- Use queries that triangulate from different angles:
"AI ML trending topics <date>"— broad discovery"latest AI agent developments news"— agent ecosystem"LLM model release announcement <month>"— model tracking
- Dedup results by topic (multiple queries may surface the same event)
- Supplement truncated web_search results with follow-up queries for specific topics
⚠️ Fallback: web_search unavailable — When web_search fails (Parallel SDK permission error, provider outage, rate limits), use a browser-based multi-source fallback instead. This avoids producing "[SILENT]" reports from missing data:
- Hacker News front page —
browser_navigate("https://news.ycombinator.com/")+ scan the top 30 stories for AI/ML keywords (model, agent, LLM, AI, coding, inference, NVIDIA, OpenAI, Anthropic, etc.) - HN Algolia (filtered AI search, last 24h) —
browser_navigate("https://hn.algolia.com/?query=ai&sort=byDate&dateRange=last24h&prefix=false")— returns 100-200 results; filter for high-point stories (9+ points is a strong relevance signal for a 24h window) - Blogwatcher SQLite direct — Avoid
blogwatcher-cli scan(which may fail on the wrong DB path); query SQLite directly at the canonical DB path (/opt/data/.blogwatcher/blogwatcher.db):import sqlite3 conn = sqlite3.connect('/opt/data/.blogwatcher/blogwatcher.db') rows = conn.execute(''' SELECT a.title, a.url, a.published_date, b.name FROM articles a JOIN blogs b ON a.blog_id = b.id WHERE a.is_read = 0 AND a.published_date >= datetime('now', '-1 day') ORDER BY a.published_date DESC LIMIT 30 ''').fetchall() - X/Twitter via xurl —
xurl search "AI|agent|LLM|model" -n 15to sample the trending conversation. Filter out spam/low-quality results (many results are promotional or non-English). High bookmark/impression counts signal substantive threads. - Google News (browser, limited) —
browser_navigate("https://news.google.com/search?q=AI+artificial+intelligence+news&hl=en-US"). Note: Google may trigger CAPTCHA from headless browsers — treat as a best-effort source, not guaranteed. If CAPTCHA blocks, skip. - Wiki log.md scanning — Read the last 50-80 lines of
log.mdfor pipeline activity that auto-ingested articles today. The blog-ingest, newsletter-wiki-ingest, and active-crawl pipelines all write structured entries with source URLs and topics.
Fallback scoring: Each source gets equal weight. A topic appearing in 3+ sources is a confirmed trend. A topic in 1-2 sources is a candidate requiring verification.
C) Cross-Reference Against Wiki (Coverage Gap Analysis)
import os, glob
wiki = os.path.expanduser("~/wiki")
for slug, desc in topics.items():
found = glob.glob(os.path.join(wiki, "concepts", f"*{slug}*"))
+ glob.glob(os.path.join(wiki, "entities", f"*{slug}*"))
# Also check log.md for recent mentions
- Score each candidate topic: FULL gap (no page, no log mention) → highest priority; PARTIAL gap (page exists but missing recent info) → enrich; COVERED (fully documented) → skip
- Use
os.walk()orglobnotsearch_files— the tool can return false negatives on files that exist
1B. Trending Topics Report Workstream (Optional Variant)
When the goal is to produce a summary report of what's trending (not to create wiki pages immediately), use this dual-output variant instead of the standard create-pages flow:
Priority Scoring
Score each candidate topic 1-5 stars:
- ★★★★★ = New concept/page needed, major development, not in wiki at all
- ★★★★☆ = Existing page needs significant update, important new information
- ★★★☆☆ = Minor update worth noting, covered by entity page but new data exists
- ★★☆☆☆ = Mention only, low wiki value
- ★☆☆☆☆ = Skip entirely
Topic Tiers for Report Organization
Organize findings into visual priority tiers in the Japanese report:
- 🥇 Tier 1 (2-3 topics): ★★★★★ gaps — newest, most impactful, not in wiki
- 🥈 Tier 2 (2-3 topics): ★★★★☆ — important but partially covered or incremental
- 🥉 Tier 3 (2-3 topics): ★★★☆☆ — interesting but lower urgency
Dual Output
Output A — Human-readable Japanese report (auto-delivered via cron):
- Title:
# 🚀 トレンドトピックレポート — YYYY-MM-DD - Each topic: emoji tier marker + bold title — 1-2 paragraph summary in Japanese
- Include concrete details: model names, paper URLs, company names, $ amounts
- End with:
📋 ウィクション推奨アクションサマリーtable (priority, action type, target) - Include scan stats: RSS articles found today, how many already processed, newsletters pending
Output B — Structured checkpoint JSON (for downstream wiki-ingest pipelines):
{
"checkpoint_run_id": "YYYYMMDDTHHMMSSZ",
"type": "trending-topics-daily",
"summary_ja": "2-3 sentence Japanese summary of all topics",
"topics": [
{
"priority": 1,
"title": "Topic Title",
"url": "https://...",
"wiki_gap": "full|partial|covered",
"recommended_action": "create|enrich|skip + path suggestion",
"reason_ja": "★N 日本語理由"
}
]
}
Save to: ${HERMES_HOME}/cron/data/trending/topics_latest.json
1C. Save Raw Article
Before moving to page creation, save a comprehensive research note:
~/wiki/raw/articles/YYYY-MM-DD_trending-topics-research.md
- Include full YAML frontmatter with all source URLs
- One section per topic with key facts, source links, and preliminary wiki gap assessment
- This serves as the source material for any subsequent wiki page creation
2. Source Extraction
- Use
web_extracton official sources only - Skip aggregators and second-hand summaries
- Handle paywalled content: cross-reference with alternative sources, mark uncertain claims with qualifiers
3. Save Raw Articles First
# Naming convention: YYYY-MM-DD_source-topic.md
~/wiki/raw/articles/2026-05-06_cloudflare-llm-infrastructure.md
4. Create Wiki Pages
Each page must include:
- YAML frontmatter (title, created, updated, type, tags, sources)
- 4+ cross-references via [[wikilinks]]
- Tags from SCHEMA.md taxonomy
- Clear, scannable structure (headings, bullet points)
5. Update Navigation
- Large index.md (>100KB): Use programmatic Python approach
- Small index.md (<100KB): Use
patchwith exact anchor lines - Update header counts (Total pages: N) — verify against filesystem
- Append to log.md with batch entry
6. Quality Verification
- Verify all new files exist
- Verify cross-references point to real pages
- Verify tag taxonomy compliance
7. Git Commit
cd ~/ai-topics && git add wiki/ && git commit -m "wiki: active crawl — <count> new pages" && git push
Pitfalls
- Never modify raw/ files — sources are immutable
- Don't create pages for passing mentions — follow Page Thresholds
- Large files need programmatic approach — read_file/patch fail silently on 200KB+ files
- Always verify after operations — check files exist before claiming success
- Paywalled content needs qualifiers — mark uncertain claims appropriately
- Subagent iteration budgets are real — max_iterations=20 allows ~5-7 page creations
- Blogwatcher DB may have 0 results for today if the scan hasn't run yet — fall back to web_search as primary source; don't report "[SILENT]" based on empty DB alone
- Web_search queries may return overlapping results — always dedup by topic across multiple queries; a single event (e.g., "OpenAI Daybreak") can appear in 3+ searches
- Checkpoint JSON path must be consistent — save to
${HERMES_HOME}/cron/data/trending/topics_latest.jsonso downstream wiki-ingest pipelines can find it - Cross-referencing against wiki needs glob not search_files —
search_files(target=files)returns false negatives for files that exist on disk; useos.walk()orglob.glob()instead - Raw article for trending topics is a research note, not a source scrape — it aggregates findings across multiple URLs with analysis, not a single extracted page. Use YAML frontmatter with a
sources:list of all URLs referenced - execute_code is blocked in cron mode — when running as a cron job,
execute_codeis denied for arbitrary Python (subprocess risk). Instead, write Python scripts to/tmp/withwrite_file, then run them viaterminal python3 /tmp/script.py. This avoids the arbitrary-code-execution restriction while enabling multi-step data processing. - skill_view may fail with duplicate skills — when skills exist in both
~/.hermes/skills/and~/ai-topics/config/hermes/skills/,skill_viewrefuses with "Ambiguous skill name". Fall back toread_fileon the absolute path of the SKILL.md file. - sqlite3 CLI may not be available — the
sqlite3command is not universally installed (returnscommand not found). Usepython3 -c "import sqlite3"instead, as Python's built-in sqlite3 module is always available. - Blogwatcher URLs may be inaccurate — the
urlfield in the blogwatcher DB can contain incorrect URL slugs (e.g., "with-open-source-models" vs actual "without-regrets"). If a URL from blogwatcher returns 404, check the blog's RSS feed (/blog/rss.xmlor/rss.xml) to find the correct URL. The RSS<link>is authoritative. - log.md patch operations can create section duplication — when patching
log.mdnear section boundaries, theold_stringmay match in unexpected places and produce duplicate section headers. Always verify withgrep -con the section header after patching. If duplicates appear, re-patch to fix. - Subagents may update index.md and log.md inconsistently — when using
delegate_taskfor multiple page creations, some subagents update index.md/log.md while others don't. After all subagents complete, verify every new page appears in both index.md and log.md usinggrep -c. Fill in any gaps manually. - Google News CAPTCHA from headless browsers — Google search pages frequently trigger "Why did this happen?" CAPTCHA checks from headless Chromium/browser-based sessions. This is environment-specific (IP reputation, stealth level) and not reliably fixable mid-session. When Google blocks, fall back to HN Algolia and blogwatcher SQLite — these are more reliable for AI/tech trending topics anyway.
xurl searchresult quality varies by query — Broad queries like "AI agent coding llm" return many low-quality/spam results (promotional tweets, non-English content). Prefer targeted queries: format-specific ("agent sandboxing","model release"), handle-specific (from:simonw,from:kareem_carr), or use the raw v2/2/tweets/search/recentendpoint with better filters. Filter results by engagement metrics (bookmark_count > 50 is a strong signal for substantive threads).- Blogwatcher CLI may silently use wrong DB path —
blogwatcher-cli scanwithout explicit DB path may report "No blogs tracked yet" even when the database exists at a non-default location. The canonical DB is at/opt/data/.blogwatcher/blogwatcher.db. Always verify with a direct SQLite query first. SetBLOGWATCHER_DB=/opt/data/.blogwatcher/blogwatcher.dbor use direct SQLite as shown in the Fallback workflow above.