name: wiki-ingestion-pipelines category: wiki description: >- Umbrella for all wiki ingestion pipelines — newsletter ingest, blog ingest, active knowledge crawl, arXiv paper pipeline, OpenAI blog ingestion, dreaming knowledge consolidation, and pipeline troubleshooting.
Wiki Ingestion Pipelines (umbrella)
This umbrella skill covers all automated wiki ingestion pipelines — from external source to wiki page. Each section below covers one pipeline end-to-end, including cron configuration, checkpoint handling, and failure recovery.
All pipelines follow the same fundamental pattern:
fetch external content → checkpoint → triage → wiki-ingest (create/update pages) → commit
Section A: Newsletter Pipeline (newsletter-wiki-ingest)
Consume a pre-triaged checkpoint JSON from the newsletter-triage cron job and create/update wiki pages autonomously.
Pipeline Chain
newsletter-ingest (07:10 UTC) → newsletter-triage (07:20 UTC) → newsletter-wiki-ingest (07:40 UTC)
Input Format
Triage checkpoint JSON is injected via context_from cron chaining, or available at:
${HERMES_HOME}/cron/data/newsletter/triage_latest.json
Workflow
- Orient on wiki: read SCHEMA.md, index.md, recent log.md
- Load the checkpoint — filter to
recommended_action === "take"decisions - Detect prior batch — scan log.md for same source/newsletter title
- Process each take decision:
- ★★★★★ → New concept page
- ★★★★☆ → Update existing page
- ★★★☆☆ → Entity page update
- Create new pages first (write_file), then update existing (patch), then index.md and log.md
- Commit and push:
cd ~/ai-topics && git add wiki/ && git commit -m "wiki: newsletter ingest ..." && git push - If no take decisions, respond
[SILENT]
Triage Failure Recovery
When checkpoint is missing or failed (most common: ok: false with output_path):
- Read the triage failure output file — it contains the embedded newsletter-ingest checkpoint with
candidatesarray - Parse candidates for
open.substack.com/pub/{pub}/p/{slug}canonical URLs (NOT tracking redirects) - Call
web_extracton canonical URLs to get newsletter post body with real article links - Filter UI noise, assign star ratings, create/update wiki pages
- Log: "newsletter-triage failed to produce valid JSON; wiki-ingest performed triage directly"
See references/newsletter-wiki-ingest.md for full Substack URL resolution patterns and State A/B/C handling.
Key Pitfalls
- Detect follow-up batches before creating pages
- Subagents need explicit absolute paths (
/opt/data/ai-topics/wiki/...) - Japanese output is mandatory for cron reports
- Commit early for large batches to prevent data loss from tool call limits
See references/newsletter-wiki-ingest.md for full workflow details.
Section B: Active Knowledge Crawl (active-knowledge-crawl)
Daily cron job that proactively researches and ingests new concepts based on config/hot-topics.yaml.
Trigger
Scheduled cron job, or manual invocation by user.
Workflow
- Select Topics: Read hot-topics.yaml, extract topics with stale
last_crawled(>3 days) - Gap Discovery (optional when few stale topics): Survey major AI domains not in hot-topics.yaml
- Research: For each topic, crawl prerequisites, laterals, or deep-dives
- Create Wiki Pages: Web search → save raw source → create concept page → update index/log
- Update hot-topics.yaml: Set
last_crawled: YYYY-MM-DD - Commit:
cd ~/ai-topics && git pull --rebase && git add wiki/ config/hot-topics.yaml && git commit && git push
Constraints
- Max 2 concepts per topic, max 6 total per run
- Source file in raw/articles/ REQUIRED before creating concept page
- Depth-1 only (grandchildren out of scope)
- arXiv-only (not peer-reviewed) papers FORBIDDEN as sources
- Git push may fail in cron — report status clearly
Critical Lessons
- Files may already be committed (duplicate run detection): check
git ls-filesfirst - Verify files exist after
delegate_task: explicit file existence check git pull --rebasefails with unstaged changes: usegit stash && pull && stash pop- YAML via str.replace fragil — use sed with line numbers for hot-topics.yaml updates
- Avoid
git add -Awhen sibling agents write to same repo — use selectivegit add
Section C: OpenAI Blog Ingestion (openai-blog-article-ingestion)
Simple workflow for ingesting openai.com/blog articles.
Workflow
- Scrape & Save:
web_extract(url)→ save towiki/raw/articles/{date}-{slug}.md - Check existing pages → patch existing concept/entity or create new
- Update index.md and log.md
- Commit:
cd ~/ai-topics && git add wiki/ && git commit -m "wiki: ingest OpenAI blog article - {topic}" && git push
Pitfalls
- OpenAI blog URLs may have
/index/path prefix - Don't create duplicate pages
- Create minimal stub entity pages for newly mentioned people/organizations
Section D: arXiv Paper Pipeline (arxiv-paper-pipeline)
Workflow for pulling arXiv papers, triaging by peer-review status, and ingesting into wiki.
Save Path
Always save to ~/wiki/raw/papers/ (NOT ~/wiki/raw/articles/).
Naming: {YYYY-MM-DD}_{arxiv_id}_{short-title}.md
Triage Decision Matrix
| Paper Type | Action |
|---|---|
| Peer-reviewed conf/journal (NeurIPS, ICML, ICLR, ACL, CVPR, JMLR, TACL, Nature, Science) | ✅ Wiki-ingest OK |
| Tech company/industry research lab tech report (OpenAI, Meta, Google, MS, Anthropic, Huawei, Apple, Amazon, NVIDIA, and similar) | ✅ Wiki-ingest OK |
| arXiv-only (no venue) | ❌ BLOCK |
| User explicitly requests blocked paper | ✅ User override — ingest with blocked_reason note |
Peer-Review Detection
- Check abstract page for "Published in", "Accepted to"
- Search Semantic Scholar for
publicationVenue - If no venue found → mark as blocked
Processing Steps
- Search arXiv API or Semantic Scholar
- For each candidate: fetch metadata → research peer-review → apply triage → save or block
- If accepted: save to papers/ → create/update wiki page
- Integrate user-provided context (tweets, discussions) alongside paper content
Name Collision Handling (RLM and similar proliferating frameworks)
When a paper's framework name collides with an existing concept page (e.g., Huawei's lambda-RLM vs Galanos's Lambda-RLM):
- Detect collision early —
search_filesfor the framework name in existing concept slugs and content before creating pages - Create a new concept page with a distinct, descriptive slug (e.g.,
typed-rlminstead of reusing the conflictinglambda-rlm) - Add frontmatter aliases on the new page to capture the paper's name:
aliases: [original-name, Y-Combinator X, etc.] - Add a disambiguation warning to the EXISTING page's top — brief note with wikilink to the new page
- Build a comparison table on the new page showing control model, formal proofs, empirical scope, source lineage
- Update the parent concept page (e.g.,
rlm-recursive-language-models) to list both as named variants - Update log.md — explain the collision, the resolution, and the comparison
See references/arxiv-paper-pipeline.md for further detail on blocked paper handling and JSON format.
Section E: Blog Pipeline Troubleshooting (blog-ingest-troubleshooting)
Debug and fix the full blog/newsletter cron pipeline chain.
Pipeline Architecture
ingest ──checkpoint──▶ triage ──checkpoint──▶ wiki-ingest
Checkpoint File Locations
| Pipeline | Ingest checkpoint | Triage checkpoint |
|---|---|---|
| Blog | ~/.hermes/cron/data/blog_ingest/latest.json |
~/.hermes/cron/data/blog_ingest/triage_latest.json |
| Newsletter | ~/.hermes/cron/data/newsletter/latest.json |
~/.hermes/cron/data/newsletter/triage_latest.json |
Most Common Failure: "Checkpoint Cascade"
- Ingest script times out → checkpoint stays stale
- Triage reads old checkpoint → sees 0 articles → nothing output
- Wiki-ingest reads empty triage →
[SILENT]
Re-executing a Pipeline
- Run ingest jobs first (blog + newsletter concurrently)
- Then triage jobs
- Then wiki-ingest jobs
⚠️ cronjob(run) is async — run ingest scripts directly from terminal if cron scheduler fails:
| Pipeline | Script path |
|---|---|
| Blog ingest | python3 ~/.hermes/scripts/blog_ingest.py |
| Newsletter ingest | python3 ~/scripts/process_email.py |
Stage-Specific Issues
Ingest:
- Missing
daily_inbox_collectmodule → create stub module - Wrong DB path → use
~/.blogwatcher/blogwatcher.db - Pre-run script timeout → parallelize with ThreadPoolExecutor, write checkpoint before scraping
- CRITICAL: SQLite
is_readdedup pattern — The blogwatcher DB (~/.blogwatcher/blogwatcher.db) hasarticlestable withis_read(boolean) anddiscovered_date(date) columns. Ifblog_ingest.pytimes out at 120s, the most likely cause isquery_todays_articles()indaily_inbox_collect.pylacking a date filter, causing it to fetch ALL articles (thousands of rows) instead of just today's. The fix:- Ensure SQL query includes
WHERE discovered_date >= date('now', '-1 day') AND is_read = 0 - After successful scrape+save in
blog_ingest.py, callmark_articles_as_read()to setis_read = 1 - This prevents duplicate processing on subsequent cron runs
- The
daily_inbox_collectmodule lives at~/.hermes/scripts/daily_inbox_collect.py - After fixing scripts, re-run the pipeline and
cd ~/ai-topics && git add wiki/ && git commit && git push
- Ensure SQL query includes
Triage: Reads from ingest checkpoint. Empty checkpoint → no output. May also produce markdown report instead of valid JSON — downstream wiki-ingest will fail with parse error. See references/blog-ingest-troubleshooting.md for recovery.
Wiki-ingest: Reads from triage checkpoint. No take decisions → [SILENT].
Script Dual Location
Scripts live in TWO locations that must be kept in sync:
| Location | Purpose | Git-tracked? |
|---|---|---|
~/ai-topics/scripts/ |
Source of truth | ✅ Yes |
~/.hermes/scripts/ |
Cron execution copy | ❌ No |
When fixing: edit ai-topics/scripts, then cp to .hermes/scripts.
Section F: Dreaming — Knowledge Consolidation Cycle (dreaming)
Automated consolidation process analyzing recently collected articles and folding significant findings into the wiki.
Pipeline
- Phase 1 (pre-run script):
~/ai-topics/scripts/dreaming.pycollects RSS scan articles, newsletters, existing wiki pages - Phase 2 (LLM processing): Analyzes, creates/updates wiki pages, commits
Workflow
- Duplicate Check: Review what adjacent scheduled jobs already completed (daily inbox update, active crawl, etc.)
- Light Sleep (Screening): Review articles not already processed, group by semantic themes
- REM (Flat Synthesis): Score each theme using weighted signals (relevance 0.30, frequency 0.25, query_diversity 0.15, recency 0.15, consolidation 0.10, conceptual_richness 0.05)
- Score ≥ 0.65: Create or update wiki page
- Score 0.45-0.65: Add to existing page or log for review
- NJ Delivery Filter: Apply Newsjacking lens (0-5) to select what to deliver
- NJ ≥ 4: Lead story; NJ = 3: Secondary; NJ ≤ 1: Omit from delivery
- Deep Sleep (Replay-safe integration): Create/update wiki pages, cross-references, index/log, commit
Sub-Patterns
- Pattern A: Existing coverage depth check — don't auto-update, check if page already covers the insight
- Pattern B: Newsletter noise filtering (Substack UI elements, redirect chains)
- Pattern C: Batch entity discovery — create missing entity pages for recurring people/companies
- Pattern D: Duplicate detection matrix (filename, index entry, content grep, session_search)
0-Article Recovery Workflow (Shell Commands)
When the dreaming checkpoint reports collected_articles=0, raw articles may still exist that other pipelines didn't consume. Use this concrete workflow:
Step 1: Count recent raw articles
find ~/wiki/raw/articles -name "*.md" -mtime -3 -size +500c | wc -l
Step 1.5: Cross-pipeline dedup check (FIRST — saves the most time)
Before scanning raw articles, check the latest blog triage JSON. This immediately rules out the entire blog-ingest batch (typically 15-20 articles already decided as skip/reference), catching ~70% of raw articles from the blog pipeline.
# Check blog triage exists
ls -la ~/.hermes/cron/data/blog_ingest/triage_latest.json
# Also check newsletter triage
ls -la ~/.hermes/cron/data/newsletter/triage_latest.json
Read the triage JSON with a Python script (pipe_to_interpreter blocked in cron mode — use write_file to /tmp/ then terminal python3):
import json, os
blog_path = os.path.expanduser("~/.hermes/cron/data/blog_ingest/triage_latest.json")
with open(blog_path) as f:
d = json.load(f)
for x in d.get("decisions", []):
print(f"{x['recommended_action']}: {x.get('source_name','')} - {x.get('title','')[:60]}")
Articles already decided in blog/newsletter triage should be marked as skip (already captured by blog pipeline) before proceeding to full analysis. This is the single most time-saving step in the recovery workflow.
Step 2: Find genuinely unprocessed articles
find ~/wiki/raw/articles -name "*.md" -size +500c -mtime -3 | while read f; do
base=$(basename "$f" .md)
count=$(grep -rl "$base" ~/ai-topics/wiki/entities/ ~/ai-topics/wiki/concepts/ ~/ai-topics/wiki/log.md 2>/dev/null | wc -l)
if [ "$count" -eq 0 ]; then
size=$(stat -c%s "$f")
echo "UNPROCESSED: $base ($size bytes)"
fi
done
This checks each article filename against entity pages, concept pages, AND log.md. An article is "unprocessed" only if zero references exist anywhere.
Step 3: Filter by AI relevance
Read each unprocessed article's first 50+ lines. Skip:
- Vintage computing, math, F1, politics, general security (non-AI)
- Event announcements, marketing promos (low wiki value)
- Link blog posts already covered by another source (check krebsonsecurity, simonwillison references)
Step 4: Check existing entity page coverage
First verify entity page exists, then check content depth:
# Quick existence check (faster than grep)
ls ~/ai-topics/wiki/entities/<entity>.md 2>/dev/null && echo "EXISTS" || echo "MISSING"
# Content depth check
grep -E "^##" ~/ai-topics/wiki/entities/<entity>.md
# Also check for article-specific keywords
grep -i "keyword-from-article" ~/ai-topics/wiki/entities/<entity>.md
If the entity page exists but lacks the article's specific content → enrichment candidate (TAKE/REFERENCE).
Step 5: Build triage JSON
Since execute_code is blocked in cron mode, use write_file to /tmp/dreaming_triage.py then terminal python3 /tmp/dreaming_triage.py. Key: use None (Python) not null (JS) for optional fields.
Step 6: Archive skip/reference items
After saving the triage JSON, archive skip and reference decisions for later re-evaluation:
cd ~/ai-topics && python3 scripts/archive_triage.py dreaming --keep-reference
Pitfalls
- Duplicate detection is MANDATORY
- Always check existing pages first (don't trust 0.65 threshold alone)
- Log.md corruption via patch (accidental
|prefix) - Pre-run script timeout → fallback file at
/opt/data/.hermes/cron/data/dreaming/grouped_themes_latest.json - Stale dreaming themes (2-3 days old) may already be processed by daily pipelines
- 0-article doesn't mean nothing to do:
collected_articles=0means other pipelines consumed sources, but raw articles may have arrived AFTER those pipelines ran. Always run the 0-article recovery workflow. - Cross-pipeline dedup order matters: Check blog triage JSON FIRST (
~/.hermes/cron/data/blog_ingest/triage_latest.json) — it instantly rules out 70%+ of raw articles. Then check log.md, then wiki pages. Reading articles should be the LAST step, not the first. grep -rlwithtarget='files'is NOT a filename lookup:search_files(target='files')searches file content with regex, not filenames. Usefind+grep -rlfor true filename-based discovery of unprocessed articles.- execute_code blocked in cron mode: Write Python scripts to
/tmp/viawrite_file, then run withterminal python3 /tmp/script.py. Do NOT usecat file | python3(pipe_to_interpreter blocked). -mtimewindow must match: Step 1 (count) and Step 2 (find unprocessed) must use the same-mtimevalue. Step 1 uses-mtime -3; Step 2 must also use-mtime -3, not-mtime -1.
Section G: Newsletter Triage (newsletter-triage)
URL Resolution Patterns (CRITICAL — raw files contain tracking URLs, not canonical)
Substack newsletters:
- Look for
open.substack.com/pub/{publication}/p/{slug}(usually Link 7 or 9) - Canonical form:
https://www.{publication}.com/p/{slug}or use the open.substack URL directly with web_extract - IGNORE:
substack.com/redirect/2/...(resolves to app download),substack.com/app-link/post?...(email tracking) - For author attribution: extract from the raw file's
substack.com/@authornamelinks
Beehiiv newsletters:
- URLs are wrapped as
link.mail.beehiiv.com/v1/c/... - Resolution:
web_searchwith subject line + date to find canonical article URL - The raw beehiiv newsletter digest file is saved in wiki/raw/newsletters/ but individual links need external resolution
Cron pipeline context: The newsletter-ingest cron job saves raw digests to wiki/raw/newsletters/ — this is the source data. Each digest contains 16-20 links, most of which are tracking/redirect URLs requiring resolution.
Classification Criteria
| Level | Criteria | Action |
|---|---|---|
| Critical | Direct AI agent/LLM relevance, comprehensive landscape updates, major product launches | Create new concept/entity pages, major enrichments |
| High | Specific tooling/workflow coverage, industry context with wiki actionability | Enrich existing entities, create concept pages |
| Medium | Weekly roundups with 1-2 relevant items | Selective entity enrichment |
| Low | No wiki actionability | Skip |
Triage Output Format
Save JSON with: triage_timestamp, run_id, newsletters[] (each with message_id, subject, source, date, canonical_url, classification, summary, wiki_relevance, recommended_action), and summary (counts, key_themes[], recommended_wiki_updates[]).
Save Locations
/opt/data/.hermes/cron/data/triage/newsletter-triage-{timestamp}.json(for downstreamnewsletter-wiki-ingest)/opt/data/ai-topics/wiki/raw/inbox/newsletter-ingest/{timestamp}.json(wiki inbox copy)
Key Pitfalls
- Raw newsletter files in wiki/raw/newsletters/ contain ONLY tracking/redirect URLs — you MUST resolve to canonical URLs before content extraction
- The beehiiv newsletter digest is saved but the source file may not appear in the raw directory listing (it IS there, just needs reading)
- Substack redirect chains:
substack.com/redirect/...→ app download page, NOT the article. Always useopen.substack.com/pub/...pattern - Multiple newsletters can arrive in the same batch — classify each independently
See references/newsletter-triage.md for detailed URL resolution patterns, classification criteria, and output format.
Section Z: Trending Topics Reporting (trending-topics)
See research/trending-topics-reporting skill for the end-to-end trending topics research/reporting workflow. This is NOT an ingestion pipeline — it produces a Japanese-language trending report saved to inbox/rss-scans/ — but it runs after all morning ingestion pipelines (12:00 UTC) and uses their output as input.
Quick Reference
# Run the trend detector
python3 ~/ai-topics/scripts/trending_topics.py --days 3
# Query DB for recent AI articles
python3 -c "import sqlite3; c=sqlite3.connect('/opt/data/.blogwatcher/blogwatcher.db').execute('''SELECT b.name, a.title, a.url FROM articles a JOIN blogs b ON a.blog_id=b.id WHERE DATE(a.discovered_date)>=date('now','-2 days') AND (a.title LIKE '%AI%' OR a.title LIKE '%agent%' OR a.title LIKE '%LLM%' OR a.title LIKE '%model%') ORDER BY b.name'''); [print(f' [{r[0]}] {r[1]}') for r in c.fetchall()]"
Key Pitfall: Dual Article Storage
Articles may be in EITHER /opt/data/ai-topics/wiki/raw/articles/ (canonical) OR /opt/data/.hermes/home/wiki/raw/articles/ (cron HOME). Always check both with find.
Section H: Daily RSS Triage (daily-rss-triage)
See references/daily-rss-triage.md for full workflow.
End-to-end pipeline for processing daily RSS scans: scan blogs → triage → ingest → commit.
Pipeline Position
Pre-run script executes blogwatcher scan, queries DB, reads newsletter, lists existing topics. The daily RSS triage is the triage + ingest stage of the blog pipeline.
Workflow
- Parse script JSON output for scan results
- Generate Japanese summary report → save to
~/ai-topics/inbox/rss-scans/daily-scan-YYYY-MM-DD.md - If article_total == 0 AND no newsletter →
[SILENT] - Apply Newsjacking Triage Filter (0-5 score):
- Trend Surfing, Polarizing Promise, Contrarian Insight, Pattern Interrupt, In-Group Signal
- Score ≥ 3: Priority triage; 1-2: Standard; 0: Low priority
- For each article: check existing wiki topics, evaluate relevance, scrape content
- Create/update wiki pages, update index/log, commit
- All reports in Japanese
Key Pitfalls
search_filesunreliable for wiki directory discovery — use Pythonos.walk()- RSS 429 rate limits — log failures, don't retry immediately
- Reddit URLs fail with web_extract — use browser tools as fallback
- Pre-staged files from previous runs — check
git diff --stagedbefore committing
Section H: Raw Article Curation (wiki-raw-article-curation)
See references/wiki-raw-article-curation.md for full workflow.
Systematically reduce the "unprocessed raw articles" count reported by wiki_health.py.
Series Registration Pattern
When multiple raw articles form a coherent series (e.g., slide decks from the same course, multi-part blog series), register them as a group in index.md rather than individually scattered across sections.
Workflow:
- Verify all series articles exist in
raw/articles/(they may already be saved but unregistered in index.md) - Add a dedicated section header in index.md:
## Raw Articles — {Series Name} (N pages) - Each entry includes: wikilink, brief description, and cross-references to companion materials (lecture transcripts, concept pages)
- If companion lecture transcripts exist in
raw/transcripts/, update the "Raw Transcripts" section count and add entries - Update the author's entity project page to link directly to raw slide articles (not just concept pages)
Example — Cheat at Search slide series:
## Raw Articles — Cheat at Search Slide Series (7 pages)
- [[raw/articles/YYYY-MM-DD_author_part-1]] — Part 1 title. Brief description. Companion: [[concepts/relevant-concept]]
- [[raw/articles/YYYY-MM-DD_author_part-2]] — Part 2 title. ...
Key pitfall: Entity project pages may link to concept pages (e.g., [[concepts/llm-search-judge]]) instead of raw slide articles. When registering a series, update entity pages to link directly to raw articles with concept pages as secondary references.
Detection
python3 ~/ai-topics/scripts/wiki_health.py | grep -A 3 "Unprocessed Raw Articles"
Mixed-Strategy Approach (< 100 unprocessed)
- "Already Consumed but Unlinked" check: Search unique phrase from article in L2 pages
- Tier 1 (High-Value): Deep-read and enrich existing wiki pages
- Tier 2 (Bulk-Associate): Add filename to existing page's References section
Association Targets
| Article Type | Best Target |
|---|---|
| Author blog | Their entity page |
| Technical concept | Relevant concept page |
| Newsletter tracking pixel | wiki/concepts/blogwatcher.md |
| Metadata-only artifacts | wiki/concepts/blogwatcher.md |
Bulk-Associate Workflow (>100 unprocessed)
- Domain analysis — group by domain/author
- Keyword-to-entity mapping
- Batch update entity pages
- Handle remaining unmatched articles
Pitfalls
- Substring matching quirk: filename stem must appear verbatim in L2 content
- Escape-drift on YAML frontmatter patches — use markdown References section instead
Section I: X Bookmarks Ingest (x-bookmarks-ingest)
Cron pipeline triggered by fetch_x_bookmarks.py that processes incoming X/Twitter bookmarks and ingests external articles into the wiki.
Pipeline Chain
fetch_x_bookmarks.py (pre-run script, every 6h) → x-bookmarks-ingest (agent cron)
Input Format
The agent cron job receives a JSON payload with new_bookmarks[] array. Each bookmark contains:
id,author_id,created_at,text,public_metrics(bookmark_count, like_count, etc.)entities.urls[]— each URL hasexpanded_url,display_url,status(HTTP), and optionallytitle/descriptionexternal_urls[]— URLs withstatus: 200that are NOT X article linksarticle— for X Articles (x.com/i/article/...), containstitlefield
Workflow
Extract actionable URLs: Filter bookmarks for
external_urls[]withstatus: 200. These are direct article links (OpenAI blog, Substack, arXiv, etc.). X Articles (x.com/i/article/...) withstatus: 500require the fallback path.Scrape external articles:
web_extract()each external URL. Save towiki/raw/articles/{YYYY-MM-DD}_{source}_{slug}.md. The OpenAI blog and Meta/FAIR research blogs typically return full content.X Article content extraction (in priority order):
a. Check
article.plain_textFIRST — The bookmark/tweet metadata frequently contains the FULL article body inarticle.plain_texteven when the URL returns HTTP 500. Ifarticle.plain_texthas substantial content (>2KB), save it directly as the raw article and skip all API/mirror fallbacks. Both the article body AND inline code blocks (article.entities.code[]) are available. Seewiki-entity-enrichment-from-articleskill'sreferences/x-article-plain-text-content.mdfor the full pattern, decision logic table, and content preservation notes.b. Try GetXAPI (if
$GETXAPI_KEYis set) — Structured JSON with headings and lists. Use the parent tweet's ID.c. Mirror search — For articles where
article.plain_textis empty or too short:- Extract
article.titlefrom bookmark metadata - Run
web_searchwith:"<article title>" 2026(add author name or domain keywords) - Common mirrors: LangChain blog (
blog.langchain.com/...), Substack, arXiv, personal blogs - If found → scrape and save. If not → mark as metadata-only, skip wiki creation.
- Notable authors often cross-post: check
blog.langchain.com,substack.com, author's personal site.
- Extract
Check for existing entity pages BEFORE creating new ones: This is the most common pitfall. Before creating any person/org entity page:
search_files(pattern="firstname.*lastname|@handle", path="~/wiki/entities", target="files")- Also check
search_files(target="content")for aliases referencing the same person under different slugs (e.g., Varun Trivedy = Vivek Trivedy = @Vtrivedy10) - If an existing page is found (even with a slightly different slug), update THAT page — do NOT create a duplicate.
Prioritize by engagement: Process highest-bookmark-count articles first (signal of importance).
Create/update wiki pages: Follow
wiki-entity-enrichment-from-articleskill for entity/concept creation. For multi-article batches by the same author (e.g., two LangChain blog posts by Vivek Trivedy), use the Multi-Source Same-Author Sequential Enrichment pattern. For a single comprehensive article that touches many pages (entity + concept + methodology + org + anti-patterns), use the Comprehensive Article Multi-Page Cascade pattern — seereferences/comprehensive-article-multi-page-cascade.md.Update index.md and log.md: One log entry summarizing the entire batch. Update index.md entry count. Patch existing concept page descriptions if significantly changed.
Commit and push:
cd ~/ai-topics && git add wiki/ && git commit -m "wiki: X bookmarks ingest — <summary>" && git push
Key Pitfalls
- Duplicate entity detection is MANDATORY: Before creating any person entity page, search for existing pages under different slugs (e.g.,
vtrivedy10.mdvsvarun-trivedy.mdvsvivek-trivedy.md). Many tracked people already have pages created bybuild_x_wiki.py. - X Articles behind auth wall:
web_extract()onx.com/i/article/...returns JavaScript wall or login page. Checkarticle.plain_textfirst — it often contains the full article body. Only useweb_searchfor mirrors as a last resort whenarticle.plain_textis insufficient or empty. Seereferences/x-article-plain-text-content.mdinwiki-entity-enrichment-from-articleskill. - Image-only bookmarks: Bookmarks where the only URLs are
pic.x.com/...media links have no scrapable content. Skip them. - Thread-only bookmarks: Bookmarks where the content is entirely in the tweet text with no external URL. Skip for article scraping (save as metadata-only).
- LangChain blog mirror pattern: When searching for X Article mirrors,
blog.langchain.comis a common target — many agent/harness engineering articles are cross-posted there. - Don't create duplicate Vivek Trivedy pages: He already has
vtrivedy10.md(188 lines, canonical) +varun-trivedy.md(173 lines, duplicate waiting for dedup). Use[[vtrivedy10]]as the wikilink target.
Deliverable Format (cron)
The final response is auto-delivered. Report findings concisely:
- ✅ Processed articles with wiki actions
- 🆕 New pages created
- ✏️ Updated pages
- ⏭️ Skipped/auth-walled articles
- 🔍 Notable discoveries (duplicates found, new entities identified)
If nothing was scrapable (all bookmarks are image-only, thread-only, or X Articles with no mirrors found), respond [SILENT].
---\n\n## JS-Rendered Site Workaround (companion GitHub repo)
Many modern doc sites (Next.js SPAs) render sub-page content client-side, so web_extract() returns empty results on all pages except the SSR'd landing page.
Solution: Check the main landing page for a companion GitHub repo link (usually a "View on GitHub" badge or footer link). Clone the repo — it typically contains markdown READMEs and source code for each module/lesson.
See references/js-rendered-docs-workarounds.md for the full workflow, detection patterns, and the Braintrust Evals 101 case study.
Manual Article Ingest Patterns
When ingesting a single URL (not batch pipeline), see references/manual-article-ingest-patterns.md for:
- Author identification via secondary search when
web_extract()omits the byline - Related-concept detection — checking existing pages before creating new ones
- Author/org/product mapping — which entities to create from a single article
- MCP tool identification — extracting tool names from article data source mentions
- Pattern 6: Substack multi-part series batch discovery — checking
/archiveto find all parts when given part 1 only
General Pipeline Pitfalls
- Always orient first — read SCHEMA.md + index + recent log before any operation
- Detect follow-up batches — check log.md for same source before creating pages
- Escape-drift on YAML frontmatter patches: Add to markdown References section instead
- Partial-match corruption on patch — When
old_stringmatches only a PREFIX of a target line (not the full content), the patch tool replaces only the matched portion and appends the REMAINING original text onto your new content. Fix: Always include enough trailing context inold_stringto uniquely identify the ENTIRE line — preferably the full text of the line from the file. Verify by reading the file first withread_file(offset, limit)and using the exact bytes shown. After every patch on index files, immediately re-read the affected lines to check for appended garbage text. If corruption occurred, fix with a second patch that replaces the corrupted substring. - Context compaction can mask prior work — review compaction summary for already-completed tasks
- Commit message
&trap: The terminal tool interprets&as shell backgrounding. If your commit message contains&(e.g.,set_to_none=True,agents & tools), use single quotes:git commit -m 'wiki: safe message here'. Double quotes fail silently. Also,&&chaining triggers the tool's backgrounding detection — split into separategit add,git commit,git pushcalls when&&chaining fails. - Tag validation blocks commits: The pre-commit hook (
~/.githooks/pre-commit-tag-validator.py) checks every YAML frontmattertags:entry against SCHEMA.md's canonical taxonomy. If a tag isn't in SCHEMA.md, the commit is blocked with a violation message. Fix: (a) check SCHEMA.md for an existing canonical tag that matches your intent (e.g., usesandboxinstead of inventingagent-sandboxing), or (b) add the new tag to SCHEMA.md before committing. Do NOT use--no-verifyto bypass — the curator workflow expects tag hygiene. - ⚠️ Content regression blocks commits (CRITICAL): The pre-commit hook
.githooks/pre-commit-content-regression.pydetects when entity/concept pages shrink by >50 lines AND >50%. This catches the #1 recurring wiki data-loss pattern: an ingestion pipeline overwriting a rich curated page with a skeleton/stub. 58 documented regression events across 9 destructive commits (worst:7b69b67dwith 15 pages,383eff68with 14 pages). Prevention: Before ANYwrite_filetowiki/entities/orwiki/concepts/,read_filethe existing page first. If it has >40 lines, usepatchto add content — NEVERwrite_fileto replace it. Recovery: When enriching a damaged page, always checkgit logfor a richer historical version first — restore the richest version as base, merge any genuinely new content, thenpatchnew info on top. Seewiki-entity-enrichment-from-articleskill'sreferences/pre-write-verification.mdfor the full protocol, git history enrichment 4-step pattern, andreferences/content-regression-scanner.shfor scanning the commit history. Cron prompt enforcement: All ingestion cron jobs (raw-backlog-ingest, x-bookmarks-ingest, skeleton-enrich-daily, newsletter-wiki-ingest, blog-wiki-ingest) must include an explicit anti-overwrite warning in their prompt preamble. - Patch tool Unicode escape-drift in cron mode: When enriching wiki pages from articles with smart quotes, em-dashes, or CJK characters,
patchfrequently fails with "Escape-drift detected". The cron-safe workaround iswrite_filea Python script to/tmp/and run it withterminal. This also handles multi-section insertions that would otherwise require multiple sequential patch calls. Seereferences/comprehensive-article-multi-page-cascade.mdfor the Python script template. - Total page count in index.md header must be correct
- concepts/_index.md drift: Many wikis have a separate
concepts/_index.mdlisting concepts by category. Creating or enriching concept pages without updating this causes index drift. Always update BOTHwiki/index.md(main index) and any sub-index files (concepts/_index.md,entities/_index.mdif they exist) when adding new pages. - New files invisible to
git addafter auto-commit: If a cron job or auto-sync mechanism committed your new wiki files before your batch commit,git statuswon't show them andgit addwon't stage them. Verify withgit ls-files | grep <new-file-name>. If the file shows as tracked butgit diff HEAD --statdoesn't include it, it was already committed — just commit the remaining modifications. - Subagents need explicit absolute paths — don't rely on HOME resolution