name: daily-rss-triage description: Daily RSS scan triage workflow — scan blogs, triage articles, ingest into wiki, commit changes category: research
Daily RSS Triage Workflow
End-to-end pipeline for processing daily RSS scans, triaging articles, and ingesting wiki-worthy content.
Prerequisites
blogwatcher-dbskill loaded (database queries)semantic-article-groupingskill loaded (triage criteria)- Pre-run script has already executed blogwatcher scan, queried DB, read newsletter, listed existing topics
Workflow
Phase 1: Report Generation
- Parse script JSON output for scan results
- Generate Japanese summary report (scan stats, failed blogs, articles list, Reddit highlights, newsletter info)
- Save to
~/ai-topics/inbox/rss-scans/daily-scan-YYYY-MM-DD.md - If article_total > 0 but ≤3 are AI-relevant, supplement with web_search (see Low-Article Day Fallback below)
- If article_total == 0 AND no newsletter →
[SILENT]
Phase 1b: Low-Article Day Fallback (web_search supplementation)
When RSS scan yields few AI-relevant articles, the blog scan alone is insufficient. Supplement proactively:
- Web search for recent AI headlines:
AI ML trending news May 2026or specific domains (model releases, security incidents, robotics, geopolitical AI chip news) - Cross-reference newsletter triage — may have processed substantive content that RSS missed
- Score discovered topics with Newsjacking filter — top 5-7 become the report's lead stories
- Save raw articles from official sources (NIST reports, Reuters exclusives, company blogs)
- Create/update wiki pages prioritizing highest wiki-actionability topics
Effective search queries (discovered in production):
"CAISI evaluation" DeepSeek V4— NIST government evaluationsUS officials weigh cutting deadlines fix digital flaws AI-powered hacking— CISA/Reuters exclusivesexodus Boston Dynamics executives humanoid delivery— Semafor/Business Insider scoopsRichard Dawkins Claude consciousness delusion— AI culture war debates- All blog articles list
- Reddit highlights (top 5 per subreddit)
- Newsletter info (if exists)
- Save to
~/ai-topics/inbox/rss-scans/daily-scan-YYYY-MM-DD.md - If article_total == 0 AND newsletter.exists == false → respond
[SILENT]and stop
Phase 1.5: Newsjacking Triage Filter (READER perspective)
Before detailed triage, apply Newsjacking lens (from Elvis Sun's framework) to identify high-signal articles:
- Trend Surfing: Does the article ride an existing wave? (e.g., Claude Code launch, new model release, viral AI tool)
- Polarizing Promise: Does it make a bold, debatable claim that creates curiosity? ("X is dead", "Everyone is wrong about Y")
- Contrarian Insight: Does it challenge conventional wisdom with data-backed arguments?
- Pattern Interrupt: Is it structurally or topically unusual for its source? (e.g., Karpathy writing about biology, Simon Willison on non-web topics)
- In-Group Signal: Does it use specialized knowledge that creates an "insider" resonance for the target audience (r/LocalLLaMA, AI agent developers)?
Scoring: Assign each article a newsjacking_score (0-5) based on how many criteria it meets.
- Score ≥ 3: Priority triage — flag for immediate wiki ingestion
- Score 1-2: Standard triage — normal evaluation
- Score 0: Low priority — only ingest if highly relevant to core interests
Phase 2: Triage
For each article, evaluate:
- Already covered? Check
existing_wiki_topicslist - Substantive? Not a link dump, not Reddit noise
- Relevant? LLMs, AI agents, coding agents, developer tooling, inference/training, prompt engineering, AI safety, open-source AI
- Newsjacking score? (from Phase 1.5) — higher scores get priority placement
Output triage table:
| ソース | タイトル | NJスコア | アクション | 対象 |
|--------|----------|----------|------------|------|
| simonwillison.net | タイトル | 4/5 | wikiエントリ作成 | entities/simon-willison.md |
| blog.example.com | タイトル | 1/5 | スキップ(既存) | — |
Phase 3: Wiki Ingestion
For each "wikiエントリ作成" article:
CRITICAL: Check existing entity pages FIRST
# Check if file exists with ANY name variation search_files(pattern="entity-name", path="~/ai-topics/wiki/entities", target="files")- The
wiki/index.mdmay reference entities with different filenames than expected (e.g.,[[entities/gpjt]]for "Giles Thomas") - Always verify file existence before creating new pages
- The
Scrape content:
web_extract([article_url])Determine category:
entities/— people, companies, blogs, toolsconcepts/— techniques, patterns, ideascomparisons/— head-to-head analysesqueries/— research questions
Create or update page:
- If updating: read existing file, append new content under appropriate section, update
updated:frontmatter - If creating: follow existing entity page format (frontmatter + overview + core ideas + related + sources)
- If updating: read existing file, append new content under appropriate section, update
Update index and log:
wiki/index.md— add/update entity reference (match the filename convention used in index)wiki/log.md— add dated entry with changes summary
Commit and push:
cd ~/ai-topics && git add wiki/ inbox/rss-scans/ && git status # CRITICAL: Check for pre-staged files from previous runs git diff --staged --stat git commit -m "wiki: daily scan YYYY-MM-DD — [summary]" && git push
Key Pitfalls
- Index stub detection:
search_files("firstname.*lastname")on the index may miss existing stubs if the index entry uses a different format (e.g.,**Role** | Professor Emeritusinstead of the person's name). Always also checksearch_files(target="content")for the person's name in the entities directory before creating a new entity page. stubs created bybuild_x_wiki.pyorbuild_blog_wiki.pymay exist even whensearch_filesreturns nothing. - Index filename mismatches: short handles vs full names
Index filename mismatches:
wiki/index.mdmay use short handles (gpjt) while you'd expect full names (gilesthomas-com). Always check the index first.Pre-staged files: Previous cron runs or sessions may have already staged files. Use
git diff --stagedbefore committing to understand what's changed.Duplicate entity creation: Always search for existing entity files before creating new ones. The same person/company may already have a page under a different name.
No content to report: If
article_total == 0AND no newsletter exists, respond[SILENT]— don't generate empty reports.Category field is JSON: In blogwatcher DB,
categoriesis a JSON array. UseLIKE '%\"tag\"%'for SQL filtering orjson.loads()in Python.Published vs discovered dates: Use
discovered_datefor "when blogwatcher found it",published_datefor "when article was published" (can be NULL).search_filesis unreliable for wiki directory discovery: It returns{"total_count": 0}for~/ai-topics/wiki/**/*.mdpatterns. Useexecute_codewith Pythonos.walk()orpathlibfor directory traversal and file existence checks instead.RSS 429 rate limits:
r/LocalLLMandr/LocalLLaMAfrequently hit HTTP 429. Log failures but do NOT retry immediately — wait for next scan cycle to avoid exacerbating rate limits.Substack redirect URLs: Newsletter articles use tracking-heavy Substack redirect URLs (e.g.,
substack.com/redirect/UUID).web_extracthandles these natively — pass the full redirect URL, do not strip tracking parameters.Batch file creation before git commit: When creating multiple wiki pages (6+), create all files first using
execute_codewith Pythonopen()/write(), then do a singlegit add wiki/ && git commit && git push. Multiple small commits are fine for updates to existing files, but batch new file creation.Context window management: When running as a cron job with many articles (90+), tool outputs may fill the context window. Use
[Old tool output cleared to save context space]pattern mid-run and reconstruct state via targetedread_filewithoffset+execute_codedirectory checks.Entity update vs create decision: For established entity files (e.g.,
antirez-com.md,pluralistic-net.md), append new sections under existing headers rather than rewriting. Preserve historical continuity and frontmatter integrity. For new entities, follow the existing frontmatter format withtitle,created,updated,tags,related.Reddit URL extraction failures:
web_extractconsistently fails on Reddit URLs with "Content was inaccessible or not found". Reddit uses Cloudflare protection and dynamic content loading that defeats simple HTTP extraction. For Reddit articles, skip scraping and only record the URL/title in triage. If content is needed, usebrowser_navigate+browser_snapshotas a fallback (higher resource cost).Git rebase in headless cron environment: When running as a cron job,
wiki/log.mdcan be modified concurrently (e.g., by another scheduled run or external process), causing git push rejections that requiregit pull --rebase. In headless environments with noEDITORset,git rebase --continuehangs. Always useGIT_EDITOR=true git rebase --continueto bypass interactive editor prompts. If conflicts occur,git checkout --theirs <file>accepts the remote version, then continue.
Output Language
All reports, triage tables, and wiki content should be in Japanese unless the source material is explicitly English-only and the user has not requested Japanese output.