name: blogwatcher description: Monitor blogs and RSS/Atom feeds for updates using the blogwatcher-cli tool. Add blogs, scan for new articles, track read status, and filter by category. version: 2.2.0 author: JulienTant (fork of Hyaxia/blogwatcher) license: MIT metadata: hermes: tags: [RSS, Blogs, Feed-Reader, Monitoring] homepage: https://github.com/JulienTant/blogwatcher-cli prerequisites:
commands: [blogwatcher-cli]
Blogwatcher
Track blog and RSS/Atom feed updates with the blogwatcher-cli tool. Supports automatic feed discovery, HTML scraping fallback, OPML import, and read/unread article management.
Installation
Pick one method:
- Go:
go install github.com/JulienTant/blogwatcher-cli/cmd/blogwatcher-cli@latest - Docker:
docker run --rm -v blogwatcher-cli:/data ghcr.io/julientant/blogwatcher-cli - Binary (Linux amd64):
curl -sL https://github.com/JulienTant/blogwatcher-cli/releases/latest/download/blogwatcher-cli_linux_amd64.tar.gz | tar xz -C ~/bin blogwatcher-cli(use~/bin/if/usr/local/bin/has permission issues) - Binary (Linux arm64):
curl -sL https://github.com/JulienTant/blogwatcher-cli/releases/latest/download/blogwatcher-cli_linux_arm64.tar.gz | tar xz -C /usr/local/bin blogwatcher-cli - Binary (macOS Apple Silicon):
curl -sL https://github.com/JulienTant/blogwatcher-cli/releases/latest/download/blogwatcher-cli_darwin_arm64.tar.gz | tar xz -C /usr/local/bin blogwatcher-cli - Binary (macOS Intel):
curl -sL https://github.com/JulienTant/blogwatcher-cli/releases/latest/download/blogwatcher-cli_darwin_amd64.tar.gz | tar xz -C /usr/local/bin blogwatcher-cli
All releases: https://github.com/JulienTant/blogwatcher-cli/releases
Docker with persistent storage
By default the database lives at ~/.blogwatcher-cli/blogwatcher-cli.db. In Docker this is lost on container restart. Use BLOGWATCHER_DB or a volume mount to persist it:
# Named volume (simplest)
docker run --rm -v blogwatcher-cli:/data -e BLOGWATCHER_DB=/data/blogwatcher-cli.db ghcr.io/julientant/blogwatcher-cli scan
# Host bind mount
docker run --rm -v /path/on/host:/data -e BLOGWATCHER_DB=/data/blogwatcher-cli.db ghcr.io/julientant/blogwatcher-cli scan
Migrating from the original blogwatcher
If upgrading from Hyaxia/blogwatcher, move your database:
mv ~/.blogwatcher/blogwatcher.db ~/.blogwatcher-cli/blogwatcher-cli.db
The binary name changed from blogwatcher to blogwatcher-cli.
Common Commands
Managing blogs
- Add a blog:
blogwatcher-cli add "My Blog" https://example.com - Add with explicit feed:
blogwatcher-cli add "My Blog" https://example.com --feed-url https://example.com/feed.xml - Add with HTML scraping:
blogwatcher-cli add "My Blog" https://example.com --scrape-selector "article h2 a" - List tracked blogs:
blogwatcher-cli blogs - Remove a blog:
blogwatcher-cli remove "My Blog" --yes - Import from OPML:
blogwatcher-cli import subscriptions.opml
Scanning and reading
- Scan all blogs:
blogwatcher-cli scan - Scan one blog:
blogwatcher-cli scan "My Blog" - List unread articles:
blogwatcher-cli articles - List all articles:
blogwatcher-cli articles --all - Filter by blog:
blogwatcher-cli articles --blog "My Blog" - Filter by category:
blogwatcher-cli articles --category "Engineering" - Mark article read:
blogwatcher-cli read 1 - Mark article unread:
blogwatcher-cli unread 1 - Mark all read:
blogwatcher-cli read-all - Mark all read for a blog:
blogwatcher-cli read-all --blog "My Blog" --yes
Environment Variables
All flags can be set via environment variables with the BLOGWATCHER_ prefix:
| Variable | Description |
|---|---|
BLOGWATCHER_DB |
Path to SQLite database file |
BLOGWATCHER_WORKERS |
Number of concurrent scan workers (default: 8) |
BLOGWATCHER_SILENT |
Only output "scan done" when scanning |
BLOGWATCHER_YES |
Skip confirmation prompts |
BLOGWATCHER_CATEGORY |
Default filter for articles by category |
Example Output
$ blogwatcher-cli blogs
Tracked blogs (1):
xkcd
URL: https://xkcd.com
Feed: https://xkcd.com/atom.xml
Last scanned: 2026-04-03 10:30
$ blogwatcher-cli scan
Scanning 1 blog(s)...
xkcd
Source: RSS | Found: 4 | New: 4
Found 4 new article(s) total!
$ blogwatcher-cli articles
Unread articles (2):
[1] [new] Barrel - Part 13
Blog: xkcd
URL: https://xkcd.com/3095/
Published: 2026-04-02
Categories: Comics, Science
[2] [new] Volcano Fact
Blog: xkcd
URL: https://xkcd.com/3094/
Published: 2026-04-01
Categories: Comics
Reddit RSS Feeds
Reddit subreddits expose RSS feeds at: https://www.reddit.com/r/SUBREDDIT/.rss
These work like any other blog feed — add with blogwatcher-cli add "r/Name" "https://www.reddit.com/r/Name/.rss"
YouTube Channel Monitoring
YouTube channels expose standard Atom/RSS feeds at https://www.youtube.com/feeds/videos.xml?channel_id=CHANNEL_ID. These work with blogwatcher-cli like any other feed — each new video becomes a discoverable article with its description as content. The daily blog-ingest cron job picks up new videos automatically.
For deep transcript ingestion of individual videos, use the youtube-content skill workflow (yt-dlp → transcript → structured raw article → entity/concept enrichment).
Adding a YouTube Channel to Monitoring
When the user asks to add a YouTube channel to monitoring:
Get the channel ID via yt-dlp:
/opt/data/bin/yt-dlp --print channel_id --playlist-end 1 "https://www.youtube.com/@HANDLE"Test the RSS feed:
curl -s "https://www.youtube.com/feeds/videos.xml?channel_id=CHANNEL_ID" | head -20Add to OPML in the "YouTube Channels" section (create the section if it doesn't exist):
<outline text="YouTube Channels" title="YouTube Channels"> <outline type="rss" text="Channel Name" title="Channel Name" xmlUrl="https://www.youtube.com/feeds/videos.xml?channel_id=CHANNEL_ID" htmlUrl="https://www.youtube.com/@HANDLE" /> </outline>Insert before
</body>inconfig/feeds/blogs.opml. Validate:python3 -c "import xml.etree.ElementTree as ET; ET.parse('config/feeds/blogs.opml'); print('OK')"Add to blogwatcher DB with explicit
--feed-url(auto-discovery won't work on YouTube's JS-heavy pages):blogwatcher-cli add "Channel Name" "https://www.youtube.com/@HANDLE" --feed-url "https://www.youtube.com/feeds/videos.xml?channel_id=CHANNEL_ID"Verify scan works:
blogwatcher-cli scan "Channel Name" # Expected: Source: RSS | Found: N | New: NMark initial backlog as read (prevents re-processing hundreds of historical videos):
blogwatcher-cli read-all --blog "Channel Name" --yesCreate wiki entity page for the channel under
wiki/entities/<channel-name>.md:- Frontmatter:
type: entity,tags: [youtube, …] - Include: subscriber count, video count, content themes, notable talks table, RSS feed URL
- Link to related entities (speakers) and concepts
- Add to
wiki/index.md(alphabetical order) andwiki/log.md
- Frontmatter:
Commit config + wiki:
cd ~/ai-topics && git add config/feeds/blogs.opml wiki/entities/ wiki/index.md wiki/log.md git commit -m "config: add Channel Name YouTube channel to RSS monitoring + entity page" git push
Pitfalls
- YouTube RSS returns only the 15 most recent videos. Historical content beyond 15 videos is not captured on initial scan — this is acceptable since the goal is ongoing monitoring.
- Channel ID is NOT the @handle. Always use
yt-dlp --print channel_idto resolve. Example:@aidotengineer→UCLKPca3kwwd-B59HNr-_lvA. - RSS entries are video descriptions only — not full transcripts. For transcript-level analysis, use the
youtube-contentskill on individual videos. - High-volume channels (3+ videos/day) will produce many raw articles in the daily blog-ingest pipeline. The
blog-triagestep filters by importance before wiki ingestion. - OPML section naming: Use
<outline text="YouTube Channels">as a top-level group, not mixed into "Individual Blogs" or "Company Tech Blogs". This keeps the OPML organized by source type.
Environment-Specific Notes (exe.dev VM)
- Binary installed at:
/opt/data/bin/blogwatcher-cli(Hyaxia/blogwatcher v0.0.2, NOT JulienTant fork) - OPML source:
~/ai-topics/config/feeds/blogs.opml(~87 blogs as of 2026-05-08)- OPML is structured in three groups:
<outline text="Individual Blogs">(84 personal blogs),<outline text="Company Tech Blogs">(RSS + sitemap-based), and<outline text="YouTube Channels">(YouTube RSS feeds) - Corporate blogs use
type="rss"for RSS-based ortype="sitemap"for sitemap-based (e.g., Anthropic Engineering)
- OPML is structured in three groups:
- Database:
/opt/data/.blogwatcher/blogwatcher.db(NOT~/.blogwatcher-cli/— uses the original Hyaxia/blogwatcher path) - ⚠️ Cron HOME mismatch: In cron,
HOME=/opt/data/.hermes/homebut DB is at/opt/data/.blogwatcher/. Thedaily_inbox_collect.pyscript must usePROFILE_ROOTinstead ofPath.home()for DB path resolution, andrun_blogwatcher_scan()must setHOME=str(PROFILE_ROOT)in the subprocess env. Seeblogwatcher-dbskill'sreferences/cron-home-fix.mdfor the exact patch. - To query the DB directly:
python3 -c "import sqlite3; conn = sqlite3.connect('/opt/data/.blogwatcher/blogwatcher.db'); ..." - Cron jobs:
blog-ingest(Job ID:1bf4c6492c1e) runs at0 7 * * *UTC = JST 16:00 — RSS blog scan via blogwatchersitemap-monitor(Job ID:c7890b6e0e19) runs at0 6 * * *UTC = JST 15:00 — sitemap-based monitoring for corporate blogs without RSS
- First scan detected ~2,045 articles across 87 feeds
YouTube Channel Monitoring
YouTube channels expose RSS feeds at a predictable URL pattern and can be monitored via blogwatcher just like blogs. Use this workflow when the user asks to add a YouTube channel to monitoring.
YouTube RSS URL Pattern
https://www.youtube.com/feeds/videos.xml?channel_id=CHANNEL_ID
To find the channel ID:
/opt/data/bin/yt-dlp --print channel_id --playlist-end 1 "https://www.youtube.com/@channelhandle" 2>/dev/null
Add to OPML
Add a new "YouTube Channels" section to config/feeds/blogs.opml before </body>:
<outline text="YouTube Channels" title="YouTube Channels">
<outline type="rss" text="Channel Name" title="Channel Name" xmlUrl="https://www.youtube.com/feeds/videos.xml?channel_id=CHANNEL_ID" htmlUrl="https://www.youtube.com/@channelhandle" />
</outline>
Add to blogwatcher DB
blogwatcher-cli add "Channel Name" "https://www.youtube.com/@channelhandle" \
--feed-url "https://www.youtube.com/feeds/videos.xml?channel_id=CHANNEL_ID"
Verify and mark backlog
blogwatcher-cli scan "Channel Name" # Verify scan works
blogwatcher-cli read-all --blog "Channel Name" --yes # Mark initial backlog as read
Create wiki entity page
Create an entity page under wiki/entities/ with channel overview, key facts, content themes, notable talks, and monitoring info. The RSS feed URL and blogwatcher pipeline details should be documented in the Monitoring section.
Pitfalls
- YouTube RSS returns video descriptions, NOT transcripts. The blog-ingest pipeline will capture video titles and descriptions for daily awareness. For full transcript ingestion of high-signal talks, use the [[youtube-content]] workflow on individual videos.
- High-volume channels: AI Engineer posts 3-5 videos/day during conference seasons. The initial scan will find 15+ entries — always mark as read to avoid re-processing.
- Channel ID is stable: Unlike channel handles which can change, channel IDs are permanent. Prefer the ID-based RSS URL over handle-based alternatives.
Check-and-Add Single Feed (One-Off Request)
When the user asks "Xのブログが監視対象に含まれているか確認し、なければ追加して" (check if X blog is monitored, add if not):
Workflow
Check OPML:
grep -i "keyword" /opt/data/ai-topics/config/feeds/blogs.opml- Case-insensitive search. Try both blog name and domain.
- If found → report to user: "すでに含まれています ✅" and done.
Discover RSS feed: If not in OPML, find the feed URL.
- Priority order: check HTML
<link>tags → try root-level paths → try blog-subdirectory paths - Root-level paths FIRST (corporate blogs often put RSS at domain root):
curl -sI "https://example.com/rss.xml" | head -1 # ← TRY FIRST curl -sI "https://example.com/feed.xml" | head -1 curl -sI "https://example.com/atom.xml" | head -1 curl -sI "https://example.com/index.xml" | head -1 - Then blog subdirectory:
curl -sI "https://example.com/blog/rss.xml" | head -1 curl -sI "https://example.com/blog/feed.xml" | head -1 curl -sI "https://example.com/feed" | head -1 - Fall back to
references/feed-discovery-patterns.mdfor platform-specific patterns, orreferences/corporate-blog-tracking.mdfor corporate blog-specific approaches (sitemap, scrape-selector). - Check sitemap as an alternative discovery method for blogs without RSS:
curl -s "https://example.com/sitemap.xml" | grep -i "blog\|engineering"
- Priority order: check HTML
Verify feed content:
curl -s "FEED_URL" | head -30- Check for valid RSS/Atom XML (
<rss,<feed,<channel>,<item>elements) - If 404 or non-XML response → try different URL, or fall back to
--scrape-selector
- Check for valid RSS/Atom XML (
Insert into OPML: Add an
<outline>entry in the correct group.- Individual blog →
<outline text="Individual Blogs">group (lines ~7-109) - Company/org blog →
<outline text="Company Tech Blogs">group (lines ~110-135) - For company blogs with RSS: add at end of group with
type="rss" - For company blogs without RSS (sitemap-based): add with
type="sitemap"andxmlUrlpointing to the sitemap URL - Format RSS:
<outline type="rss" text="Blog Name" title="Blog Name" xmlUrl="FEED_URL" htmlUrl="SITE_URL" /> - Format sitemap:
<outline type="sitemap" text="Blog Name" title="Blog Name" xmlUrl="SITEMAP_URL" htmlUrl="SITE_URL" /> - Verify with:
grep -n "Blog Name" blogs.opml
- Individual blog →
Add to blogwatcher DB AND scan + cleanup:
# Add (RSS): blogwatcher-cli add "Blog Name" "https://example.com/blog" --feed-url "FEED_URL" # OR Add (scrape): blogwatcher-cli add "Blog Name" "https://example.com/blog" --scrape-selector "article a" # Verify scan works: blogwatcher-cli scan "Blog Name" # Mark initial backlog as read (FOUND: N / NEW: N → all historical): blogwatcher-cli read-all --blog "Blog Name" --yesCommit config change:
cd ~/ai-topics && git add config/feeds/blogs.opml && git commit -m "config: add Blog Name RSS feed" && git push(If author has existing wiki entity page) — Enrich the entity page:
- Check:
search_files path=~/wiki/entities pattern="author-name"— if found, enrich the existing page - Blog URL in infobox: Update stale blog URLs (e.g., deleted WordPress → active Medium)
- Add Monitoring section at an appropriate point in the page structure:
## Monitoring | Source | Method | Status | |--------|--------|--------| | RSS Feed | [feed-url](FEED_URL) | ✅ Active (added YYYY-MM-DD; RSS → blogwatcher → daily `blog-ingest` pipeline) | | X/Twitter | [@handle](https://x.com/handle) | Tracked in `x-accounts.yaml` (if applicable) | - Recent notable work: Scan the RSS feed output (
curl -s "FEED_URL" | head -30) for significant recent posts. If something substantial is found (major technical post, research result, framework release), add a relevant section to the entity page with key details. Don't add trivial posts. - Update frontmatter: Bump
updated:date, add new sources, add any new tags - Update Related/Sources sections: Add cross-links to new concepts, append new source URLs
- New tags → SCHEMA.md: If a genuinely new tag is added to frontmatter, add it to SCHEMA.md taxonomy FIRST
- Update index.md: Enrich the entity's index entry with new monitoring/blog info and notable work
- Update log.md: Prepend log entry (use
execute_code+ Pythonwith open()prepend to avoid---anchor trap — see wiki-graph-health skill §H5) - Commit everything together:
cd ~/ai-topics && git add config/feeds/ wiki/ && git commit -m "config+wiki: add Blog Name RSS + entity enrichment" && git push
- Check:
Pitfalls
- OPML is NOT the blogwatcher DB: Adding to OPML does NOT add to the blogwatcher SQLite database. The next full scan will auto-discover it.
- Blogwatcher-cli does NOT follow HTTP redirects: If the feed URL redirects, it will fail at scan time. Verify the final destination URL.
- Webflow sites: Often serve RSS at
/blog/rss.xmlrather than root-level paths. The HTML<head>may not include a<link type="application/rss+xml">tag even though the feed exists. - Corporate blogs: RSS at root, not under /blog/: Sierra (
sierra.ai/rss.xml) and OpenAI (developers.openai.com/rss.xml) both serve RSS from the domain root. Always check root-level paths (/rss.xml,/feed.xml) BEFORE checking blog subdirectories. - Engineering blogs on separate subdomains: Some companies host their engineering blog on a different subdomain from their marketing blog (e.g., Ramp's engineering blog is at
builders.ramp.com, notramp.com/blog). Always check the company's job postings, GitHub org, or search"company engineering blog"to discover these before concluding no blog exists. - Scrape-selector blogs: skip OPML: If a blog has no RSS feed and uses
--scrape-selector, do NOT add it to the OPML. OPML requiresxmlUrl. Track it only in the blogwatcher DB. Consider migrating to sitemap-based monitoring (see Sitemap-Based Monitoring section) for greater reliability. - Always mark initial backlog as read: After adding a new blog, run
read-all --blog "Name" --yesto prevent the daily scan from re-processing hundreds of historical articles. - Migrate scrape-selector → sitemap when possible: HTML scraping is fragile. If the site has a sitemap with blog URLs and
lastmoddates, migrate tositemap_monitor.py. Remove from blogwatcher DB, add to sitemap_monitor SOURCES list, and add to OPML Company group withtype="sitemap". - Commit only
config/feeds/: Adding a blog is a config change, not a wiki change. Do notgit add wiki/unless you also made wiki changes this session.
Corporate Tech Blog Feed Discovery
Corporate tech blogs (Anthropic Engineering, OpenAI Developers, Sierra, etc.) often have non-standard RSS setups. Follow this extended discovery workflow:
For single-blog discovery, use the workflow below. For batch discovery across dozens of URLs (e.g., when ingesting a curated company list), use the parallel curl approach in references/rss-batch-discovery.md.
Discovery Priority (updated)
- Check HTML
<link>tags forapplication/rss+xmlorapplication/atom+xml - Try root-level paths FIRST — corporate blogs often place feeds at the domain root, NOT under
/blog/:curl -sI "https://example.com/rss.xml" | head -1 # ← TRY THIS FIRST curl -sI "https://example.com/feed.xml" | head -1 curl -sI "https://example.com/atom.xml" | head -1 curl -sI "https://example.com/index.xml" | head -1 - Then try the blog subdirectory:
curl -sI "https://example.com/blog/rss.xml" | head -1 curl -sI "https://example.com/blog/feed.xml" | head -1 - Check the sitemap for blog post URLs — many corporate sites expose all blog URLs with
lastmoddates:curl -s "https://example.com/sitemap.xml" | grep -i "blog\|engineering" - Fall back to HTML scraping with
--scrape-selectorif no RSS exists.
Real-World Examples
| Blog | RSS URL | Discovery Method |
|---|---|---|
| OpenAI Developers Blog | https://developers.openai.com/rss.xml |
<link rel="alternate"> in HTML head (root-level, not /blog/rss.xml) |
| Sierra Blog | https://sierra.ai/rss.xml |
Root-level trial-and-error (not advertised in HTML) |
| Anthropic Engineering | NONE | HTML scrape-selector (article a) — sitemap at anthropic.com/sitemap.xml has all /engineering/ URLs |
HTML Scrape-Selector Fallback
When no RSS feed exists, use blogwatcher-cli's --scrape-selector:
blogwatcher-cli add "Blog Name" "https://example.com/blog" --scrape-selector "article a"
Selector guidance:
- Start broad (
article a) and verify with a scan - If the selector catches too many non-article links, refine it
- Test immediately after adding:
blogwatcher-cli scan "Blog Name" - Common selectors that work:
article a,a[href^="/blog/"],h2 a,.post-title a
Successful examples:
article a→ Anthropic Engineering (found 27 articles)article h2 a→ typical blog listing pattern
Post-Addition: Mark Initial Backlog as Read
After adding a new blog, the first scan finds ALL historical articles. Mark them as read so future scans only highlight new content:
blogwatcher-cli read-all --blog "Blog Name" --yes
Sitemap-Based Monitoring (Advanced)
For high-value corporate blogs without RSS (e.g., Anthropic Engineering), sitemap monitoring is more reliable than HTML scraping. The sitemap_monitor.py script handles this automatically.
Script: ~/ai-topics/scripts/sitemap_monitor.py (source of truth) → ~/.hermes/scripts/sitemap_monitor.py (cron execution copy)
How it works:
- Fetches the site's
sitemap.xml - Filters URLs matching a pattern (e.g.,
/engineering/) - Compares against previously seen URLs stored in
~/.hermes/processed_{name}.json - Scrapes new articles with BeautifulSoup + httpx
- Saves raw articles to
wiki/raw/articles/ - Outputs checkpoint JSON to
~/.hermes/cron/data/sitemap_monitor/latest.json
Cron integration: sitemap-monitor job (ID: c7890b6e0e19) runs daily at 0 6 * * * UTC (JST 15:00) with no_agent=True. Raw articles are later picked up by the dreaming pipeline (18:00 UTC).
Single script, multiple companies: The SOURCES list supports unlimited companies in one script. Each source gets its own state_file for independent deduplication. No per-company script needed.
SOURCE config template (each company gets one entry):
{
"name": "Company Blog", # Display name for logging
"url": "https://company.com/blog", # Blog homepage
"sitemap_url": "https://company.com/sitemap.xml", # Sitemap URL
"url_pattern": "/blog/", # URL filter (e.g. "/blog/", "/engineering/", "/news/")
"file_prefix": "company-name", # Used in raw article filenames: {date}_{prefix}_{slug}.md
"title_suffixes": [" | Company", " - Company"], # Patterns to strip from <title> tags
"state_file": os.path.expanduser("~/.hermes/processed_company_blog.json"),
}
Field notes:
file_prefix— required. Prevents filename collisions between sources.title_suffixes— optional but recommended. Many corporate blogs append" | Company Name"to<title>. Stripping these keeps downloaded article titles clean.url_pattern— must match blog post URLs exactly. If the blog is at/blog/post-slug, use/blog/. If at/news/post-slug, use/news/. Too broad a pattern (e.g.,/) matches every page on the site.state_file— one per source, unique paths. Two sources sharing one file corrupt dedup.
Adding a new sitemap-based source:
- Edit
sitemap_monitor.py— add an entry to theSOURCESlist following the template above. Verifyurl_patternby checking the sitemap. - Add to OPML Company group with
type="sitemap" - Do NOT add to blogwatcher DB (sitemap monitor is independent)
- Copy updated script to
~/.hermes/scripts/ - Commit:
cd ~/ai-topics && git add config/feeds/ scripts/ && git commit && git push - First run scrapes ALL historical articles (state file seeds the seen-URL set). Subsequent cron runs only find new posts.
Company Blog Monitoring Tiers
When ingesting a large list of companies (e.g., from Paraform Talent Density Index, YC batches, startup rankings), prioritise by tier:
Tier 1 — AI core companies (monitor with RSS or sitemap immediately): Frontier labs, AI infrastructure, coding agents, embedding/voice models. These blogs produce content directly relevant to the wiki domain.
Tier 2 — Important but lower AI density: Autonomous vehicles, defense AI, legal AI, recruiting AI, dev-tool-adjacent. Add to sitemap monitor in the next batch. Worth monitoring but yield is lower.
Tier 3 — Non-AI companies on the list: E-commerce, wholesale marketplaces, billing platforms, generic analytics. Skip or add scrape-selector if they occasionally publish AI content.
Tier X — No blog found:
Some companies simply don't have a tech blog. Mark as no_blog in the entity page and move on.
Batch ingestion workflow:
- Run
references/rss-batch-discovery.mdon all blog URLs → get RSS found/not-found split - Add RSS-found blogs to blogwatcher DB + OPML in one batch (chain
blogwatcher-cli addcommands) - For RSS-not-found, check sitemaps → add Tier 1/2 to
sitemap_monitor.pySOURCES - Add sitemap sources to OPML Company group with
type="sitemap" - Run a manual first scan to seed state files
- Commit config + scripts together
Migrating from scrape-selector to sitemap
# 1. Remove from blogwatcher DB
blogwatcher-cli remove "Anthropic Engineering" --yes
# 2. Add to sitemap_monitor.py SOURCES list
# 3. Add to OPML Company group with type="sitemap"
# 4. Run sitemap_monitor.py to seed the state file
/opt/data/.hermes/venv/bin/python sitemap_monitor.py
This avoids the fragility of CSS selectors breaking on site redesigns.
Bulk Adding Blogs from Curated Lists
When a reference article or newsletter recommends blogs to add, follow this workflow to keep both the blogwatcher DB and the OPML in sync.
Workflow Steps
- Extract blog recommendations from the source article using
web_extractorcurl. - Compare against current blogwatcher DB:
/opt/data/bin/blogwatcher-cli blogs— produce a list of missing blogs. - For each missing blog, add to blogwatcher DB:
blogwatcher-cli auto-discovers RSS at scan time, but the DB stores/opt/data/bin/blogwatcher-cli add "Blog Name" "https://example.com"feed_url=Noneuntil then. - Discover RSS/Atom feed URLs for each new blog. See
references/feed-discovery-patterns.mdfor platform-specific patterns.# Quick check: look for <link> tags curl -sL https://example.com | grep -iE '(rss|atom|feed)' | grep -oP 'href="[^"]*\.(xml|rss|atom)"' - Update the OPML at
~/ai-topics/config/feeds/blogs.opmlwith proper<outline>entries including the discoveredxmlUrl. Usepatchto insert entries before the closing</outline>tag:<outline type="rss" text="Blog Name" title="Blog Name" xmlUrl="https://example.com/feed.xml" htmlUrl="https://example.com/" /> - Verify OPML validity:
python3 -c "import xml.etree.ElementTree as ET; ET.parse('~/ai-topics/config/feeds/blogs.opml')" - Commit and push:
cd ~/ai-topics && git add config/feeds/blogs.opml && git commit -m "wiki: add N blogs from <source>" && git push
Pitfalls
- blogwatcher-cli add does NOT discover feeds immediately. It sets
feed_url=Noneuntil the nextscan. The OPML should still have the explicitxmlUrlto serve as the source of truth. - Blogger/Blogspot feeds sometimes use an ID-based path (
feeds/posts/default) rather than the blog name. The?alt=rssquery param also works. - Substack feeds accept both
https://name.substack.com/feedandhttps://name.substack.com/feed.xml. - Inactive blogs (no posts in 3+ years) should be skipped — mark them as blocked and note the reason.
- Not all blogs have RSS. For these, either skip or use
--scrape-selectorif the site has a predictable HTML structure. - Keep the OPML and blogwatcher DB in sync manually — they are independent stores. Adding to one does not update the other.
Reference
See references/feed-discovery-patterns.md for detailed platform feed URL patterns and the full ClickHouse curated list feed table.
Troubleshooting Feed Failures
Common issues when scanning feeds:
- 301/302/308 redirects: blogwatcher-cli does NOT follow redirects. Fix by removing the old entry and re-adding with the correct URL (often
www.prefix or different path). - 406 Forbidden / 404 Not Found: The feed URL in OPML may be wrong. Try discovering the actual feed URL via
curl -sL <blog_url>and looking for<link type="application/rss+xml">or<link type="application/atom+xml">tags. - Connection timeout: The site may be down or blocking the request. Consider removing from the tracking list.
Quarto on GitHub Pages: False Positive RSS Feeds
Quarto blogs with feed: true in _quarto.yml generate a <link rel="alternate" type="application/rss+xml" href="index.xml"> tag in the HTML <head>. However, the actual index.xml file may not exist on the deployed gh-pages branch — the Quarto build sometimes omits it even though it is configured.
Key insight: This is a false positive, not a regular broken feed. The HTML actively advertises an RSS feed that was never produced by the build.
Detection workflow:
- Check the feed URL:
curl -sI https://blog.example.com/index.xml | head -5— if 404, the feed does not exist despite the HTML<link>tag. - Verify the deployed branch:
curl -sL "https://api.github.com/repos/owner/repo/git/trees/gh-pages?recursive=1" | python3 -c "import sys,json; [print(t['path']) for t in json.load(sys.stdin).get('tree',[]) if 'xml' in t['path'].lower()]"— ifindex.xmlis absent from the tree, the build never produced it. - Confirm by checking the site config:
_quarto.ymllisting section withfeed: true+ URL base path.
Action: If confirmed broken, consider:
- Re-adding with
--scrape-selectorfallback if the blog has predictable HTML structure - Filing an issue with the blog author about the missing feed
- Removing from OPML if scraping is not viable
Important: OPML vs DB are independent
blogwatcher-cli maintains its own SQLite database at ~/.blogwatcher-cli/blogwatcher-cli.db. This is separate from the OPML file. Two critical implications:
- Removing a URL from the OPML does NOT remove it from the blogwatcher-cli DB. The DB still tracks and scans it independently.
- Importing an OPML adds feeds but does NOT remove feeds no longer in the OPML. Each
blogwatcher-cli importis additive.
If a feed fails in the daily RSS scan, check the blogwatcher-cli DB, not just the OPML:
# Check if a blog is in the DB
blogwatcher-cli blogs | grep "failing-blog.com"
# Remove from DB
blogwatcher-cli remove "failing-blog.com" --yes
Troubleshooting: When CLI Commands Return Empty/Unreadable Output
Sometimes blogwatcher-cli blogs or blogwatcher-cli articles return empty or garbled output (encoding issues, terminal rendering problems). Fall back to direct SQLite queries:
# List all tracked blogs with last scan date
sqlite3 ~/.blogwatcher-cli/blogwatcher-cli.db \
"SELECT name, url, feed_url, last_scanned FROM blogs ORDER BY name;"
# List articles for a specific blog (newest first)
sqlite3 ~/.blogwatcher-cli/blogwatcher-cli.db \
"SELECT a.title, a.url, a.published_date, a.is_read FROM articles a JOIN blogs b ON a.blog_id = b.id WHERE b.name = 'blogname.com' ORDER BY a.published_date DESC LIMIT 20;"
# Count articles per blog
sqlite3 ~/.blogwatcher-cli/blogwatcher-cli.db \
"SELECT b.name, COUNT(a.id) as article_count FROM blogs b LEFT JOIN articles a ON b.id = a.blog_id GROUP BY b.name ORDER BY article_count DESC;"
# Find unread article count
sqlite3 ~/.blogwatcher-cli/blogwatcher-cli.db \
"SELECT COUNT(*) FROM articles WHERE is_read = 0;"
# Check DB schema
sqlite3 ~/.blogwatcher-cli/blogwatcher-cli.db ".schema"
Tables: blogs (id, name, url, feed_url, scrape_selector, last_scanned), articles (id, blog_id, title, url, published_date, discovered_date, is_read, categories), schema_migrations.
Fix workflow for broken feeds
# 1. Remove the broken entry from the DB (not just OPML)
blogwatcher-cli remove "broken-blog.com" --yes
# 2. Verify the correct feed URL
curl -sL -o /dev/null -w "%{http_code}" "https://blog.com/feed.xml"
# 3. Re-add with correct URL
blogwatcher-cli add "broken-blog.com" "https://blog.com" --feed-url "https://blog.com/correct-feed.xml"
Excluding Blogs for Content Reasons (Off-Domain)
When a tracked blog's content is rarely or never relevant to the wiki domain (e.g., no AI/LLM content from a systems-programming blog), exclude it from RSS monitoring while preserving any wiki entity pages. This is distinct from broken-feed removal — the feed works, but the signal-to-noise ratio is too low.
Full Removal Workflow
Verify the blog is in OPML and DB:
grep -n "blog-name" ~/ai-topics/config/feeds/blogs.opmland/opt/data/bin/blogwatcher-cli blogs | grep "blog-name"Remove from OPML: Use
patchto delete the<outline>line. After removal, validate:python3 -c "import xml.etree.ElementTree as ET; ET.parse('/opt/data/ai-topics/config/feeds/blogs.opml'); print('OPML OK')"Remove from blogwatcher DB (always do both — OPML and DB are independent):
/opt/data/bin/blogwatcher-cli remove "Blog Name" --yesCheck for entity page:
search_files path=~/wiki/entities pattern="author-or-domain-name"- If entity page exists: update it with a Monitoring section marking RSS as excluded
- If no entity page: skip — nothing to update in wiki
Update entity page (if one exists):
- Remove the RSS URL from the infobox if present
- Add or update a Monitoring section:
## Monitoring | Source | Method | Status | |--------|--------|--------| | RSS Feed | — | ❌ Excluded (YYYY-MM-DD) — AI content is rare; RSS removed from blogwatcher + OPML to reduce noise | - Bump
updated:date in frontmatter
Update wiki/index.md: Append exclusion note to the entity's index entry (e.g.,
RSS excluded (YYYY-MM-DD, AI content rare))Update wiki/log.md: Prepend a log entry documenting what was removed and why. Use
python3prepend to avoid---frontmatter anchor traps:python3 -c " with open('/opt/data/wiki/log.md') as f: content = f.read() entry = '## [YYYY-MM-DD] config | RSS exclusion — Name\n\n...\n\n' with open('/opt/data/wiki/log.md', 'w') as f: f.write(entry + content) "Commit and push:
cd ~/ai-topics && git add config/feeds/blogs.opml wiki/ && git commit -m "config: exclude Blog Name from RSS monitoring (off-domain content)" && git push
Pitfalls
OPML and blogwatcher DB are independent — removing from one does NOT remove from the other. Always do both steps.
Don't delete entity pages — the exclusion is at the RSS level only. The entity page records that the person/org exists and why they're excluded.
Don't use
blogwatcher-cli read-allbefore removing — since the blog is being removed entirely, there's no need to mark old articles as read.Exclusion is reversible — if the author later starts writing AI content, re-add by following the Check-and-Add workflow. The entity page's Monitoring section documents the exclusion reason and date, making it easy to revisit.
Commit includes
wiki/when entity/index/log were updated, unlike simple OPML-only config commits.Auto-discovers RSS/Atom feeds from blog homepages when no
--feed-urlis provided.Falls back to HTML scraping if RSS fails and
--scrape-selectoris configured.Categories from RSS/Atom feeds are stored and can be used to filter articles.
Import blogs in bulk from OPML files exported by Feedly, Inoreader, NewsBlur, etc.
Database stored at
~/.blogwatcher-cli/blogwatcher-cli.dbby default (override with--dborBLOGWATCHER_DB).Use
blogwatcher-cli <command> --helpto discover all flags and options.blogwatcher-cli does NOT follow HTTP redirects — this is the #1 cause of scan failures when importing from OPML files with outdated URLs.