name: llm-wiki description: "Karpathy's LLM Wiki: build/query interlinked markdown KB." version: 2.1.0 author: Hermes Agent license: MIT platforms: [linux, macos, windows] metadata: hermes: tags: [wiki, knowledge-base, research, notes, markdown, rag-alternative] category: research related_skills: [obsidian, arxiv]
Karpathy's LLM Wiki
Build and maintain a persistent, compounding knowledge base as interlinked markdown files. Based on Andrej Karpathy's LLM Wiki pattern.
Unlike traditional RAG (which rediscovers knowledge from scratch per query), the wiki compiles knowledge once and keeps it current. Cross-references are already there. Contradictions have already been flagged. Synthesis reflects everything ingested.
Division of labor: The human curates sources and directs analysis. The agent summarizes, cross-references, files, and maintains consistency.
When This Skill Activates
Use this skill when the user:
- Asks to create, build, or start a wiki or knowledge base
- Asks to ingest, add, or process a source into their wiki
- Asks a question and an existing wiki is present at the configured path
- Asks to write a blog post, essay, or analysis drawing from their wiki knowledge
- Asks to lint, audit, or health-check their wiki
- References their wiki, knowledge base, or "notes" in a research context
- Runs as a scheduled active-crawl cron job to discover and fill knowledge gaps
Wiki Location
Location: Set via WIKI_PATH environment variable (e.g. in ~/.hermes/.env).
If unset, defaults to ~/wiki.
WIKI="${WIKI_PATH:-$HOME/wiki}"
In this environment ~/wiki should resolve to ~/ai-topics/wiki. Do not silently
switch to any alternate location.
The wiki is just a directory of markdown files — open it in Obsidian, VS Code, or any editor. No database, no special tooling required.
Architecture: Three Layers
wiki/
├── SCHEMA.md # Conventions, structure rules, domain config
├── index.md # Sectioned content catalog with one-line summaries
├── log.md # Chronological action log (append-only, rotated yearly)
├── raw/ # Layer 1: Immutable source material
│ ├── articles/ # Web articles, clippings
│ ├── papers/ # PDFs, arxiv papers
│ ├── transcripts/ # Meeting notes, interviews
│ └── assets/ # Images, diagrams referenced by sources
├── entities/ # Layer 2: Entity pages (people, orgs, products, models)
├── concepts/ # Layer 2: Concept/topic pages
├── comparisons/ # Layer 2: Side-by-side analyses
├── queries/ # Layer 2: Filed query results worth keeping
└── events/ # Layer 2: Time-bounded events (launches, incidents, announcements)
Layer 1 — Raw Sources: Immutable. The agent reads but never modifies these. Layer 2 — The Wiki: Agent-owned markdown files. Created, updated, and cross-referenced by the agent.
Subdirectory Pattern for Complex Topics
When a topic grows beyond 5-10 concept pages, organize into a subdirectory with a synthesis index:
wiki/concepts/
├── agentic-engineering.md # Parent overview page (brief)
├── agentic-engineering/ # Subdirectory for deep dives
│ ├── _index.md # Cross-leader synthesis hub
│ ├── red-green-tdd.md
│ ├── first-run-the-tests.md
│ └── using-git-with-agents.md
└── other-concept.md
The _index.md synthesizes perspectives from multiple opinion leaders on the shared topic,
mapping convergences, divergences, and unresolved questions. Use the cross-leader-synthesis
skill for this pattern.
Rules:
_index.mdis the entry point — always read it first for a topic overview- Parent page (
agentic-engineering.md) stays as a brief definition + links to subdirectory - Each concept page in the subdirectory links back to
_index.md _index.mdlinks to each leader's entity page via[[wikilinks]]Layer 3 — The Schema:SCHEMA.mddefines structure, conventions, and tag taxonomy.
Resuming an Existing Wiki (CRITICAL — do this every session)
When the user has an existing wiki, always orient yourself before doing anything:
① Read SCHEMA.md — understand the domain, conventions, and tag taxonomy.
② Read index.md — learn what pages exist and their summaries.
③ Scan recent log.md — read the last 20-30 entries to understand recent activity.
WIKI="${WIKI_PATH:-$HOME/wiki}"
# Orientation reads at session start
read_file "$WIKI/SCHEMA.md"
read_file "$WIKI/index.md"
read_file "$WIKI/log.md" offset=<last 30 lines>
Only after orientation should you ingest, query, or lint. This prevents:
- Creating duplicate pages for entities that already exist
- Missing cross-references to existing content
- Contradicting the schema's conventions
- Repeating work already logged
For large wikis (100+ pages), also run a quick search_files for the topic
at hand before creating anything new.
Initializing a New Wiki
When the user asks to create or start a wiki:
- Determine the wiki path (from
$WIKI_PATHenv var, or ask the user; default~/wiki) - Create the directory structure above
- Ask the user what domain the wiki covers — be specific
- Write
SCHEMA.mdcustomized to the domain (see template below) - Write initial
index.mdwith sectioned header - Write initial
log.mdwith creation entry - Confirm the wiki is ready and suggest first sources to ingest
SCHEMA.md Template
Adapt to the user's domain. The schema constrains agent behavior and ensures consistency:
# Wiki Schema
## Domain
[What this wiki covers — e.g., "AI/ML research", "personal health", "startup intelligence"]
## Conventions
- File names: lowercase, hyphens, no spaces (e.g., `transformer-architecture.md`)
- Every wiki page starts with YAML frontmatter (see below)
- Use `[[wikilinks]]` to link between pages (minimum 2 outbound links per page)
- When updating a page, always bump the `updated` date
- Every new page must be added to `index.md` under the correct section
- Every action must be appended to `log.md`
- **Provenance markers:** On pages that synthesize 3+ sources, append `^[raw/articles/source-file.md]`
at the end of paragraphs whose claims come from a specific source. This lets a reader trace each
claim back without re-reading the whole raw file. Optional on single-source pages where the
`sources:` frontmatter is enough.
## Frontmatter
```yaml
---
title: Page Title
created: YYYY-MM-DD
updated: YYYY-MM-DD
type: entity | concept | comparison | query | summary
tags: [from taxonomy below]
sources: [raw/articles/source-name.md]
# Optional quality signals:
confidence: high | medium | low # how well-supported the claims are
contested: true # set when the page has unresolved contradictions
contradictions: [other-page-slug] # pages this one conflicts with
---
Required fields: title, created, updated, type, tags, sources. Omitting sources is the most common frontmatter gap at scale — if you have no source to cite, use sources: [].
confidence and contested are optional but recommended for opinion-heavy or fast-moving
topics. Lint surfaces contested: true and confidence: low pages for review so weak claims
don't silently harden into accepted wiki fact.
raw/ Frontmatter
Raw sources ALSO get a small frontmatter block so re-ingests can detect drift:
---
source_url: https://example.com/article # original URL, if applicable
ingested: YYYY-MM-DD
sha256: <hex digest of the raw content below the frontmatter>
---
The sha256: lets a future re-ingest of the same URL skip processing when content is unchanged,
and flag drift when it has changed. Compute over the body only (everything after the closing
---), not the frontmatter itself.
Tag Taxonomy
[Define 10-20 top-level tags for the domain. Add new tags here BEFORE using them.]
Example for AI/ML:
- Models: model, architecture, benchmark, training
- People/Orgs: person, company, lab, open-source
- Techniques: optimization, fine-tuning, inference, alignment, data
- Meta: comparison, timeline, controversy, prediction
Rule: every tag on a page must appear in this taxonomy. If a new tag is needed, add it here first, then use it. This prevents tag sprawl.
Page Thresholds
- Create a page when an entity/concept appears in 2+ sources OR is central to one source
- Add to existing page when a source mentions something already covered
- DON'T create a page for passing mentions, minor details, or things outside the domain
- Split a page when it exceeds ~200 lines — break into sub-topics with cross-links
- Archive a page when its content is fully superseded — move to
_archive/, remove from index
Entity Pages
One page per notable entity. Include:
- Overview / what it is
- Key facts and dates
- Relationships to other entities ([[wikilinks]])
- Source references
Concept Pages
One page per concept or topic. Include:
- Definition / explanation
- Current state of knowledge
- Open questions or debates
- Related concepts ([[wikilinks]])
Comparison Pages
Side-by-side analyses. Include:
- What is being compared and why
- Dimensions of comparison (table format preferred)
- Verdict or synthesis
- Sources
Update Policy
When new information conflicts with existing content:
- Check the dates — newer sources generally supersede older ones
- If genuinely contradictory, note both positions with dates and sources
- Mark the contradiction in frontmatter:
contradictions: [page-name] - Flag for user review in the lint report
### index.md Template
The index is sectioned by type. Each entry is one line: wikilink + summary.
```markdown
# Wiki Index
> Content catalog. Every wiki page listed under its type with a one-line summary.
> Read this first to find relevant pages for any query.
> Last updated: YYYY-MM-DD | Total pages: N
## Entities
<!-- Alphabetical within section -->
## Concepts
## Comparisons
## Queries
## Events
Events section: Time-bounded happenings (product launches, security incidents, funding announcements). Use type: event in frontmatter. These are distinct from concept pages which capture enduring knowledge.
Scaling rule: When any section exceeds 50 entries, split it into sub-sections
by first letter or sub-domain. When the index exceeds 200 entries total, create
a _meta/topic-map.md that groups pages by theme for faster navigation.
log.md Template
# Wiki Log
> Chronological record of all wiki actions. Append-only.
> Format: `## [YYYY-MM-DD] action | subject`
> Actions: ingest, update, query, lint, create, translate, archive, delete
> When this file exceeds 500 entries, rotate: rename to log-YYYY.md, start fresh.
## [YYYY-MM-DD] create | Wiki initialized
- Domain: [domain]
- Structure created with SCHEMA.md, index.md, log.md
Log rotation: When log.md exceeds 500 lines, archive it as log-YYYY.md and create a fresh file with the header + rotation entry. See wiki-graph-health skill's references/log-rotation.md for the full procedure.
Core Operations
1. Ingest
When the user provides a source (URL, file, paste), integrate it into the wiki:
① Capture the raw source:
- URL → use
web_extractto get markdown, save toraw/articles/ - PDF → use
web_extract(handles PDFs), save toraw/papers/ - Pasted text → save to appropriate
raw/subdirectory - Name the file descriptively:
raw/articles/karpathy-llm-wiki-2026.md - Add raw frontmatter (
source_url,ingested,sha256of the body). On re-ingest of the same URL: recompute the sha256, compare to the stored value — skip if identical, flag drift and update if different. This is cheap enough to do on every re-ingest and catches silent source changes. - Single-page research collections: When a research listing page shows post
summaries but all individual post URLs return 404, extract the listing as a
single source and supplement with GitHub repos, HuggingFace blogs, and X posts.
See
references/single-page-research-collections.mdfor full procedure.
② Discuss takeaways with the user — what's interesting, what matters for the domain. (Skip this in automated/cron contexts — proceed directly.)
③ Check what already exists — search index.md and use search_files to find
existing pages for mentioned entities/concepts. This is the difference between
a growing wiki and a pile of duplicates.
④ Write or update wiki pages:
- New entities/concepts: Create pages only if they meet the Page Thresholds in SCHEMA.md (2+ source mentions, or central to one source)
- Existing pages: Add new information, update facts, bump
updateddate. When new info contradicts existing content, follow the Update Policy. - Sources: ⚠️ SOURCES GATE — Every page MUST have a
sources:frontmatter field. For raw article ingestion:sources: [raw/articles/<filename>]. For web extraction: list the URLs. If unsure, usesources: []— never omit the field entirely. - Cross-reference: Every new or updated page must link to at least 2 other
pages via
[[wikilinks]]. Check that existing pages link back. - Tags: ⚠️ TAG GATE — Before writing any page, read
wiki/SCHEMA.mdand ensure ALL tags come from its taxonomy. If you need a new tag, add it to SCHEMA.md FIRST. Never use ad-hoc or composite kebab-case tags. Violations are blocked by pre-commit hook. - Provenance: On pages synthesizing 3+ sources, append
^[raw/articles/source.md]markers to paragraphs whose claims trace to a specific source. - Confidence: For opinion-heavy, fast-moving, or single-source claims, set
confidence: mediumorlowin frontmatter. Don't markhighunless the claim is well-supported across multiple sources.
⑤ Update navigation:
- Add new pages to
index.mdunder the correct section, alphabetically - Update the "Total pages" count and "Last updated" date in index header
- Append to
log.md:## [YYYY-MM-DD] ingest | Source Title - List every file created or updated in the log entry
⑥ Report what changed — list every file created or updated to the user.
A single source can trigger updates across 5-15 wiki pages. This is normal and desired — it's the compounding effect.
6. JP→EN Translation Batch (Bilingual Wiki Maintenance)
When the wiki has Japanese-language content that needs systematic English translation,
run the batch workflow documented in references/jp-en-translation-batch.md:
Scan and rank — walk all non-raw
.mdfiles, count JP chars in body (after frontmatter)Select top N — recommend 8 files per batch (~30K JP chars). Exclude
log.md/log-2026.md.Choose translation method by file type — Three translation approaches, each suited to a different file structure:
File Type Best Approach When It Works When It Fails Fully-JP concept/entity (<200 lines, >90% JP in body) Full body rewrite via write_fileorexecute_code(Python triple-quoted strings)Coherent JP paragraphs that translate cleanly to one block Multi-section files where EN/JP boundaries are ambiguous Scattered-JP (mostly EN, JP in 10+ locations: headers, table cells, inline text) delegate_tasksubagent with context-aware translation instructionsFiles with scattered JP where context matters for accurate translation (log files, index files) Fully-JP files >500 lines where subagent hits token limits Light scattered-JP (JP in <10 specific locations) Per-file str.replace()dict viaexecute_codeKnown replacement strings in fixed positions (section headers, table column headers) Files where replacement strings have whitespace mismatches or Unicode-variant characters ⚠️ Chinese vs Japanese CJK characters: When scanning for Japanese content, the regex
[\u4E00-\u9FFF]captures both Chinese characters (漢字) and Japanese kanji. Many wiki files contain Chinese proper nouns (person names like 姚顺雨, organization names like 智谱AI, 月之暗面, product names). These are NOT Japanese and should be preserved. Use the hiragana/katakana-only regex[\u3040-\u309F\u30A0-\u30FF]to identify translatable Japanese text. Seereferences/chinese-proper-noun-handling.mdfor the verification script and detailed handling procedure. When a scan reports "CJK characters," verify whether they're Chinese proper nouns before attempting translation.⚠️ str.replace() has fundamental limits for scattered JP — A replacement dict with 200+ entries may only remove 15-30% of JP chars per pass on log files with scattered JP. The silent non-match rate is high because JP particles (の, を, へ) appear in many different contexts. After 3 aggressive passes with 400+ total replacement entries, ~60% of scattered JP remained. Delegate_task subagents are dramatically more effective for this case — each subagent can understand context and properly translate scattered JP in a single pass. Tested on a 1732-line log file with 4,015 JP chars: str.replace() removed 670 over 3 passes; one subagent pass removed the remaining 3,345 with context-aware translation.
When to use which approach in a batch: Split the 8 files into groups:
- Concept/entity files (small, coherent JP body) →
execute_codewith full body rewrites - Log files or large files with scattered JP →
delegate_task(1 subagent per file, up to 3 in parallel) - This prevents str.replace() frustration on scattered-JP files while keeping concept files fast
- Proven in production: The 2026-05-27 batch (8 files, ~7,362 JP chars) used this split: 5 concept files via
execute_code(full rewrite), 3 log files viadelegate_task(scattered-JP translation). All 8 achieved 0 residual JP.
⚠️ Fullwidth colon trap: After translation, 6 remaining "JP chars" in a 1732-line file turned out to be fullwidth colons
:(U+FF1A) — CJK punctuation, not Japanese language content. These appear in log entries, table headers, and inline descriptions. Always include acontent.replace('\uff1a', ':')step in the final cleanup pass. Also check for fullwidth semicolons;(U+FF1B).⚠️ Never use
assert new_jp == 0in batch scripts — an assertion on file #4 kills the entire batch, losing progress on files #1-3 and requiring differential recovery for files #5-8. Use print+continue instead (seereferences/jp-en-translation-batch.md→ Principle 1). ⚠️ Split fully-JP files into a separate batch from mixed JP/EN files — they have fundamentally different failure modes (full rewrite vs per-string match). Each batch type gets isolated debugging (seereferences/jp-en-translation-batch.md→ Principle 2). ⚠️ Quote-escaping trap:str.replace()values containing apostrophes ('spossessives) break single-quoted Python strings. Writing the script to a temp file (instead ofterminal("python3 -c ...")) only bypasses bash backtick substitution — Python single-quote conflicts remain. Two fixes:a. Use
\u2019(Unicode right single quotation mark) for apostrophes:'NVIDIA\u2019s inference'avoids the conflict entirely. b. Use triple-quoted Python strings ("""...""") forstr.replace()values so single quotes inside are parsed as literal characters. c. Escape with\'inside single-quoted strings:'NVIDIA\\'s inference'— works but requires backslash-escaped quotes which hurt readability.Commit both approaches as habits: write complex scripts to
/tmp/viawrite_file, AND use\u2019or triple-quoted strings for any line containing apostrophes. ⚠️ Multi-line replace trap: Astr.replace()with\n\nbetween heading and body can silently fail due to subtle whitespace differences. Use shorter, line-by-linestr.replace()calls instead of multi-line spans. Debug withrepr()on the target content to find exact characters. See pitfalls inreferences/jp-en-translation-batch.mdfor details.- Concept/entity files (small, coherent JP body) →
Process simplest to most complex — fewer JP chars first, index pages last.
Log via
execute_codewith header/chrono splitting (neverf.write(new_entry + content)).
See references/jp-en-translation-batch.md for the complete workflow, scan script, and pitfalls.
2. Query
When the user asks a question about the wiki's domain:
① Read index.md to identify relevant pages.
② For wikis with 100+ pages, also search_files across all .md files
for key terms — the index alone may miss relevant content.
③ Read the relevant pages using read_file.
④ Synthesize an answer from the compiled knowledge. Cite the wiki pages
you drew from: "Based on [[page-a]] and [[page-b]]..."
⑤ File valuable answers back — if the answer is a substantial comparison,
deep dive, or novel synthesis, create a page in queries/ or comparisons/.
Don't file trivial lookups — only answers that would be painful to re-derive.
Karpathy's insight: "Good answers can be filed back into the wiki as new pages. A comparison you asked for, an analysis, a connection you discovered — these are valuable and shouldn't disappear into chat history. This way your explorations compound in the knowledge base just like ingested sources do." This is the philosophical foundation for why queries belong in the wiki, not just in chat history.
Query page format — see references/query-page-format.md for the full template. Key elements:
- Question metadata block (question, questioner, date, channel, answer summary)
- Structured answer with tables and cross-references to wiki pages
- Conclusion with concrete recommendations
- Unresolved questions / open issues section
- Cross-reference to the wiki-wide concept the answer draws from (e.g., Karpathy's insight)
Creating the first query page: If wiki/queries/ doesn't exist yet, create it with mkdir -p. The index.md's Queries section will say > No queries filed yet. — replace this with the first entry. Subsequent queries get inserted alphabetically.
⑥ Update log.md with the query and whether it was filed.
3. Lint
When the user asks to lint, health-check, or audit the wiki:
① Orphan pages: Find pages with no inbound [[wikilinks]] from other pages.
# Use execute_code for this — programmatic scan across all wiki pages
import os, re
from collections import defaultdict
wiki = "<WIKI_PATH>"
# Scan all .md files in entities/, concepts/, comparisons/, queries/
# Extract all [[wikilinks]] — build inbound link map
# Pages with zero inbound links are orphans
② Broken wikilinks: Find [[links]] that point to pages that don't exist.
③ Index completeness: Every wiki page should appear in index.md. Compare
the filesystem against index entries.
④ Frontmatter validation: Every wiki page must have all required fields (title, created, updated, type, tags, sources). Tags must be in the taxonomy.
⑤ Stale content: Pages whose updated date is >90 days older than the most
recent source that mentions the same entities.
⑥ Contradictions: Pages on the same topic with conflicting claims. Look for
pages that share tags/entities but state different facts. Surface all pages
with contested: true or contradictions: frontmatter for user review.
⑦ Quality signals: List pages with confidence: low and any page that cites
only a single source but has no confidence field set — these are candidates
for either finding corroboration or demoting to confidence: medium.
⑧ Source drift: For each file in raw/ with a sha256: frontmatter, recompute
the hash and flag mismatches. Mismatches indicate the raw file was edited
(shouldn't happen — raw/ is immutable) or ingested from a URL that has since
changed. Not a hard error, but worth reporting.
⑨ Page size: Flag pages over 200 lines — candidates for splitting.
⑩ Tag audit: List all tags in use, flag any not in the SCHEMA.md taxonomy.
⑪ Log rotation: If log.md exceeds 500 entries, rotate it.
⑫ Report findings with specific file paths and suggested actions, grouped by severity (broken links > orphans > source drift > contested pages > stale content > style issues).
⑬ Append to log.md: ## [YYYY-MM-DD] lint | N issues found
Article Scanning & Prioritization Pipeline
When facing a large backlog of articles (500+) to potentially ingest:
- Scan all articles using
blogwatcher-cli articlesor equivalent - Parse into structured data (blog, title, URL, date) using execute_code
- Score by priority:
- Priority blogs (simonwillison.net, antirez.com, paulgraham.com, etc.): +10
- AI-relevant keywords in title (llm, agent, coding, model, inference, etc.): +1 per match
- Recency bonus: +1 per day within last 5 days
- Filter threshold: score >= 10 for direct Wiki creation, score 5-9 for reference, score < 5 skip
- Process in batches of 5 articles per delegate_task with max_iterations >= 20
- CRITICAL: verify files exist after delegate_task completes before claiming success:
search_files "new-page-name.md" path="$WIKI" target="files" - Update index.md and log.md in the same session after verification
5. Synthesize (Original Content Creation)
When the user asks to write original content (blog posts, essays, analyses, opinion pieces) sourced primarily from the wiki knowledge base:
① Search the wiki thoroughly — use search_files across concepts/, entities/,
comparisons/, and raw/articles/ for the topic and related terms. Cast a wide net
initially; narrow after you see what's available.
② Read the key pages in depth — prioritize concept pages with high cross-reference density and entity pages with documented philosophies. Don't skim — the best synthesis comes from deeply understanding a few sources, not shallowly scanning many.
③ Supplement with web search — only when wiki knowledge is thin or you need a specific external claim to bolster an argument. Prefer authoritative sources (arXiv papers, company blogs, recognized thought leaders). Cite these alongside wiki sources.
④ Compose the essay — write in the user's language. Match the requested tone (lively/活き活き, academic, conversational, provocative). Structure as a narrative journey: thesis → evidence → counterpoint → synthesis. Quote primary sources directly where they're vivid or iconic.
⑤ Cite everything — every major claim needs a source. Wiki pages get [[wikilinks]],
external URLs get inline links. Distinguish between the author's words and your analysis.
For a wiki-sourced essay, the ratio should be ~70% wiki + ~30% external/web.
⑥ Save as a raw article — use wiki/raw/articles/ with the pattern
YYYY-MM-DD_hermes_{topic-slug}.md for original content. Include YAML frontmatter
with title, date, author: Hermes (kzinmr's AI Topics), tags, and sources
(listing both wiki pages and external URLs).
⑦ Update log.md with the new article, key sources referenced, and a brief summary. Commit and push.
Quality guidelines:
- Depth over breadth: 3-4 well-explored arguments > 10 shallow mentions
- Concrete examples: Abstract claims need specific instances (model names, papers, experiments)
- Narrative arc: Open with the core tension, build through evidence, close with synthesis
- Quotable quotes: Direct quotes from Gwern, Chung, Sutton, etc. make the essay vivid
- User language: Write in the language the user used to make the request
When NOT to synthesize:
- The topic is peripheral to the wiki domain → redirect the user
- The wiki has < 3 relevant pages → web search first, note thin coverage
- The user wants a factual report, not an essay → use Query mode instead
4. Active Crawl (Proactive Gap Discovery)
When running as a scheduled cron job to discover and fill knowledge gaps (no user-provided sources):
① Research trending topics — use web_search to find recent AI/ML announcements,
arXiv papers (peer-reviewed only), and official blog posts from major labs.
Target 3-5 topics the wiki likely hasn't covered.
② Cross-reference against the wiki — use execute_code to check which
topics already have pages. Run the full pattern from references/cross-reference-check.md
(candidate list → os.walk() scan → sorted FOUND/MISSING report with counts and paths).
Additional check: after the filesystem scan, grep index.md for near-matches of
each MISSING candidate's core slug (e.g., search nvidia-nemotron for candidate
nvidia-nemotron-3). The index may list a canonical full name like
nvidia-nemotron-3-nano-omni that your simplified candidate slug won't match on
os.path.exists(). This prevents creating duplicate pages with different slugs
for the same entity.
③ Extract original sources — use web_extract on official sources (company blogs,
arXiv abstracts, tech reports). Pass all URLs in a single web_extract call to batch-fetch.
Skip aggregator sites and second-hand summaries. If web_extract returns truncated content
with "LLM summarization timed out", use browser_navigate on those URLs individually
as a fallback — it produces full text snapshots including all paragraphs, lists, and quotes.
Note: paywalled sources (Fortune, beehiiv, some NYT) may return sparse content —
cross-reference with alternative sources.
④ Save raw articles first — write_file to wiki/raw/articles/ with descriptive
names (YYYY-MM-DD_source-topic.md).
⑤ Create wiki pages — batch the creations. Each page must have: frontmatter,
4-6 cross-references, tags from the SCHEMA.md taxonomy.
⚠️ Cross-reference pre-flight: After writing all pages, verify ALL [[wikilinks]]
in new pages resolve to actual files BEFORE committing. Run the verification script
from references/cross-reference-prefight.md and patch any broken links immediately.
This avoids the common "commit → lint finds broken links → patch cycle" that costs
3-4 extra tool calls per broken link.
⑥ Update navigation — use patch to insert new index entries alphabetically
(see "Patch-based index insertion" below). Update section counts and total pages.
⑦ Validate tags before commit — Before git commit, verify ALL tags in ALL new/updated pages exist in SCHEMA.md. The pre-commit hook will catch violations, but catching them proactively avoids a blocked commit and retry cycle. Use this checklist:
Predict the violations (active-crawl specific): The #1 cause of tag violations in active-crawl is new company/product names used as tags. When you create entity pages for newly discovered companies/products (Cohere, Stability AI, Antigravity, etc.), those names MUST be in SCHEMA.md's People/Orgs or Products categories. Before writing any page, scan your planned tags for company/product identifiers and check whether SCHEMA.md already lists them. If not, add them to the relevant category line FIRST, then write the page with those tags.
General checklist:
- After writing all pages, scan each page's
tags:line with a mental check against SCHEMA.md categories - Company/product tags (cohere, stability-ai, antigravity, kubernetes, etc.) → add to People/Orgs, Products, or Infrastructure BEFORE committing
- Domain-neutral tags you assumed existed (e.g.,
music,typescript) → check SCHEMA.md; if missing, either add or replace with existing near-equivalent (audio-generation,developer-tooling) - Batch all SCHEMA.md tag additions with
patchcalls on the relevant category lines - Then
git add wiki/and commit. If the hook still blocks, check the error output for ALL offending files (not just yours — pre-existing staged violations from other pipelines also block)
⑧ Log and commit — single log entry covering the batch, commit + push.
Patch-Based Index Insertion
For large index.md files (500+ lines), use patch with old_string/new_string
to insert new entries alphabetically rather than reading the entire file.
Alternative: Bulk str.replace() via execute_code (10+ changes)
When making 10+ index changes (insertions + description updates), a single execute_code call with str.replace() across all updates can be more efficient than issuing 10+ individual patch calls:
with open(index_path) as f:
content = f.read()
# Read exact full lines BEFORE writing code
old = '- [[entities/daytona-io]] — old summary...'
new = '- [[entities/daytona-io]] — new summary with podcast details...'
content = content.replace(old, new)
# For insertions, use a 2-line anchor (line before + line after)
old_anchor = '- [[entities/evanhahn-com]] — **Blog** | evanhahn.com |\n- [[entities/fabiensanglard-net]] — **Blog** | fabiensanglard.net |'
new_with_insert = '- [[entities/evanhahn-com]] — **Blog** | evanhahn.com |\n- [[entities/exa]] — New entry summary...\n- [[entities/fabiensanglard-net]] — **Blog** | fabiensanglard.net |'
content = content.replace(old_anchor, new_with_insert)
with open(index_path, 'w') as f:
f.write(content)
CRITICAL - Anchor string accuracy: Every old_string anchor MUST be the exact full line as it appears in the file — no truncation, no abbreviation. A common failure mode is using a shortened anchor (e.g., stopping at the first sentence of a multi-sentence line), which silently fails to match. Read the exact lines with read_file(offset=N, limit=5) before writing the str.replace() code. If an anchor doesn't match:
- The str.replace silently does nothing for that entry
- Verify with
if old in content: print("OK") else: print("FAIL")for each replacement - Re-read the target lines to get the exact content and retry
When to use which approach:
- 1-4 index changes: Use individual
patchcalls (atomic, explicit, easy to verify) - 5+ index changes: Use
execute_codewithstr.replace()(fewer tool calls, batch-verifiable) - Both approaches require reading exact lines first — never guess what the anchor text looks like
CRITICAL: Do NOT compute insertion points programmatically. The index at scale
is NOT strictly alphabetical — entries drift over time as different agents and
pipelines add content. A programmatic "find first alphabetical successor" algorithm
will almost always place entries in the wrong position, creating duplicates that
require cleanup. Instead, use read_file with the target offset to visually
confirm the 2-3 surrounding lines, then use those exact lines as the old_string
anchor.
Discovery step (do this first): Read the relevant section of index.md with
read_file(offset=N, limit=20) to find the exact anchor lines. Do NOT use
execute_code with os.walk() or regex-based successor computation — it will
place entries incorrectly in non-alphabetical sections. See
references/index-insertion-discovery.md for the full pattern and failure cases.
# Instead of read_file + edit + write_file:
patch(
old_string="- [[entities/gemini]] — summary...\n- [[entities/gemma-4]] — summary...",
new_string="- [[entities/gemini]] — summary...\n- [[entities/gemini-enterprise-agent-platform]] — ...\n- [[entities/gemma-4]] — summary...",
path="~/wiki/index.md"
)
Use 2-3 surrounding lines as the old_string anchor to ensure uniqueness.
Also update the header counts and section counts with separate patch calls.
Advantages over read+write:
- No line-number prefixes to accidentally paste into the file
- Minimal token cost (only the changed lines)
- Atomic — either the match succeeds or nothing changes
- Works for 1500+ line files without loading the whole thing
Subagent Batch Management
When using delegate_task for wiki operations:
- Set max_iterations >= 20 per subagent for creating 5+ pages
- Each page creation costs ~3-4 tool calls (read reference + write + verify)
- Always include index.md and log.md updates in the same batch
- Never claim completion without verifying files exist after the subagent returns
- If a subagent hits max_iterations, check what was actually written and complete remaining work
CRITICAL: Path Resolution (Both Main Agent and Subagents)
Use the canonical profile paths:
- Repo root:
~/ai-topics - Wiki root:
~/wiki - Hermes scripts:
~/.hermes/scripts
~/wiki is the source of truth and is expected to resolve to the git-managed
wiki under ~/ai-topics/wiki. Do not infer or write to alternate paths such as
/opt/data/.hermes/home/wiki, /opt/data/home/wiki, or historical
~/.hermes/hermes-agent/wiki locations.
If a delegated context has an isolated HOME, it must still target the canonical
paths above. Treat any ~/.hermes/home/... path that appears during execution as
a runtime artifact, not a second wiki.
for f in os.listdir(src_dir):
if f.endswith('.md'):
src_path = os.path.join(src_dir, f)
dst_path = os.path.join(dst_dir, f)
if not os.path.exists(dst_path):
shutil.copy2(src_path, dst_path)
print(f"COPIED: {f}")
elif os.path.getsize(src_path) > os.path.getsize(dst_path):
shutil.copy2(src_path, dst_path)
print(f"UPDATED: {f}")
**Prevention strategy:** When writing subagent instructions, explicitly state the full absolute path:
- Use `/opt/data/ai-topics/wiki/concepts/` (NOT `~/wiki/concepts/`)
- Or include the copy script in the subagent's instructions as a final step
**Verification is still required** — even with absolute paths, subagents may ignore them.
## Git Merge Corruption Recovery
When a merge or automated edit corrupts wiki files (especially `log.md`), follow this recovery procedure:
### Symptoms to look for
1. **Line number prefixes in content:** Lines like ` 1|## 2026-04-15` indicate `read_file` output was accidentally pasted into the file
2. **Missing sections:** A recent commit's entry disappeared from `log.md` after a merge
3. **Orphaned headers:** `### Updated Entity Pages` without its parent `## YYYY-MM-DD — Title`
4. **Missing `---` separators:** Sections run together without dividers
### Recovery procedure
① **Check for git conflict markers first:**
```bash
grep -n "<<<<<<\|======\|>>>>>>" wiki/log.md
If found, this is a standard merge conflict — resolve with standard git workflow.
② If no conflict markers but content is missing, find the lost content in git history:
# Find the commit that originally added the missing section
git log --all --oneline -- wiki/log.md | grep -i "<keyword>"
# Recover the content from that commit
git show <commit-hash>:wiki/log.md
③ Fix line number corruption: Remove lines matching the pattern ^\s*\d+| that were accidentally embedded:
import re
with open("wiki/log.md") as f:
content = f.read()
# Remove corrupted line-number-prefixed lines
lines = content.split('\n')
cleaned = [line for line in lines if not re.match(r'^\s*\d+|##', line)]
④ Restore missing section headers: If a section lost its ## YYYY-MM-DD — Title header (common when line-number corruption removes it), add it back before the orphaned ### subsections.
⑤ Add missing --- separators: Ensure sections are properly delimited.
⑥ Verify the fix:
# Check no more corruption patterns
grep -c "^\s*\d+|" wiki/log.md
# Verify section count makes sense
grep -c "^## 2026" wiki/log.md
⑦ Commit and push:
cd ~/ai-topics && git add wiki/log.md && git commit -m "fix: restore <section> and remove line number corruption" && git push
Prevention
- Never paste
read_fileoutput directly into wiki files — the line number prefixes (1|) corrupt the markdown - Do NOT use
read_file().contentprogrammatically for file modification — the returnedcontentfield already contains line-number prefixes.result = read_file(path); lines = result["content"].split("\\n")and writing those back embeds prefixes in the file. Instead, always use direct Python file I/O:with open(path) as f: content = f.read()orlines = f.readlines(). This is the same reasonpatchis preferred overread_file+write_filefor edits — see "Patch-Based Index Insertion" for the correct approach. - After merge operations, always verify
log.mdstructure — merges can silently drop sections - Use
git show <commit>:pathto recover lost content rather than reconstructing from memory
Working with the Wiki
Searching
# Find pages by content
search_files "transformer" path="$WIKI" file_glob="*.md"
# Find pages by filename
search_files "*.md" target="files" path="$WIKI"
# Find pages by tag
search_files "tags:.*alignment" path="$WIKI" file_glob="*.md"
# Recent activity
read_file "$WIKI/log.md" offset=<last 20 lines>
Bulk Ingest
When ingesting multiple sources at once, batch the updates:
- Read all sources first
- Identify all entities and concepts across all sources
- Check existing pages for all of them (one search pass, not N)
- Create/update pages in one pass (avoids redundant updates)
- Update index.md once at the end
- Write a single log entry covering the batch
Batch page creation via execute_code: When creating 3+ wiki pages in a session,
consolidate the file writes into a single execute_code invocation that writes all pages
(raw articles, entity pages, concept pages) in one Python script. This is more token-efficient
than N separate write_file calls. Use Python's os.makedirs(exist_ok=True) for directory
creation and open(path, 'w') for each file. After the batch, verify all files exist with
os.path.exists() checks in the same script. Similarly, batch index insertions by issuing
multiple patch calls in parallel (they target different sections and don't conflict).
Batch enrichment via execute_code (preferred over N × write_file): When enriching
N existing entity/concept pages with new sections (from raw articles), use a SINGLE
execute_code call with Python's with open(path) as f: content = f.read() for reading
each file, apply modifications (string replacements, section insertions, updated date bumps),
then with open(path, 'w') as f: f.write(content). This pattern:
- Avoids N separate
write_filecalls (token overhead) - Lets you verify all files in one
print()summary at the end - Supports cross-file consistency checks (same
updateddate, consistent source references) - Safety: always read first to check existing content and avoid duplicate sections.
Use
if 'target section header' not in content:guards to prevent idempotency errors. - After writing, print file sizes and line counts to verify no truncation occurred.
- Works for any number of files — in production tested with 5 simultaneous enrichments (Harvey SOC, Petra Donka Buzz, Glean MCP eval, Hex repos, Decagon agent eng).
⚠️ Split strategy when monolithic execute_code fails: When the enrichment script grows large (10+ files, nested strings with quotes), Python syntax errors from complex .replace() calls with escaped quotes are common. Also fails on large dict literals with multi-line f-string values — the parser loses track of closing braces, producing SyntaxError: '{' was never closed. This is a distinct failure mode from quote escaping; the fix is the same split strategy below. Recovery pattern:
- New pages first: Write brand-new entity/concept pages via parallel
write_filecalls — these are simple, independent, and never trigger quote-escape issues. - Enrichments second: Handle existing-page enrichments in a separate
execute_codecall that reads + modifies + writes each file using Python'swith open()pattern. - Why this works:
write_filebypasses the Python-internal quoting problem entirely for new content, whileexecute_codemodifications avoid it by never placing the full content string inside Python string literals — they read from disk instead. - After both batches complete, verify all files exist and update index.md + log.md in a final step.
⚠️ Markdown table pipe prefix mismatch (silent str.replace failure): Before using str.replace() to modify a markdown table row (e.g., adding a feature to a feature table), check whether the file uses single | or double || pipe prefixes for table rows. Files created by different pipelines may differ: some use | Feature | Detail | (single pipe), others || Feature | Detail || (double pipe). A str.replace() targeting || Claude Opus 4.8 | ... || will silently do nothing on a file using single-pipe format. Detection after failure: read_file(offset=N, limit=3) shows the actual prefix. Fix: When the first str.replace fails, use repr() on the target lines to discover the exact pipe prefix. Prevention: Always read the exact table row via read_file before writing the replacement strings — match the actual pipe count, don't assume ||. This was discovered in a 9-file batch enrichment where 3 replacements silently failed due to single-pipe vs double-pipe mismatch.
⚠️ Frontmatter updated field may be entirely absent: Some older entity pages (especially those created by the blog-wiki pipeline or early agent sessions) have a created field but NO updated field. A str.replace() targeting "updated: 2026-05-11" will silently fail — there's nothing to replace. Detection: Before running the batch, check whether "updated:" exists in each target file's frontmatter (first 15 lines). Fix: If updated: is absent, add it after created: using content.replace(f"created: {created_date}\n", f"created: {created_date}\nupdated: {today}\n"). If the roles: or education: keys appear between created and where updated should go, the frontmatter has non-standard ordering — handle by inserting after created: regardless. Prevention: Include a pre-flight check in every batch enrichment: read the first 15 lines of each target file and assert the presence of "updated:" — or add it if missing.
⚠️ Section insertion anchor mismatch — heading vs list item: When inserting new content before an existing section, verify the anchor text is actually a heading (### Section Title) and not a list item (- **Section Title**:). These look identical at a glance but have different match behavior. Common in entity pages where feature descriptions use - **Feature Name**: Description format rather than ### Feature Name. A str.replace() targeting ### Mid-conversation system messages: will silently fail against the actual - **Mid-conversation system messages**:. Detection after failure: read_file(offset=N, limit=3) reveals the actual line prefix. Prevention: Read the exact 3 lines around the insertion point before writing the replacement string; copy the exact prefix (including dash, stars, indent level) for the anchor.
Archiving
When content is fully superseded or the domain scope changes:
- Create
_archive/directory if it doesn't exist - Move the page to
_archive/with its original path (e.g.,_archive/entities/old-page.md) - Remove from
index.md - Update any pages that linked to it — replace wikilink with plain text + "(archived)"
- Log the archive action
Obsidian Integration
The wiki directory works as an Obsidian vault out of the box:
[[wikilinks]]render as clickable links- Graph View visualizes the knowledge network
- YAML frontmatter powers Dataview queries
- The
raw/assets/folder holds images referenced via![[image.png]]
For best results:
- Set Obsidian's attachment folder to
raw/assets/ - Enable "Wikilinks" in Obsidian settings (usually on by default)
- Install Dataview plugin for queries like
TABLE tags FROM "entities" WHERE contains(tags, "company")
If using the Obsidian skill alongside this one, set OBSIDIAN_VAULT_PATH to the
same directory as the wiki path.
Obsidian Headless (servers and headless machines)
On machines without a display, use obsidian-headless instead of the desktop app.
It syncs vaults via Obsidian Sync without a GUI — perfect for agents running on
servers that write to the wiki while Obsidian desktop reads it on another device.
Setup:
# Requires Node.js 22+
npm install -g obsidian-headless
# Login (requires Obsidian account with Sync subscription)
ob login --email <email> --password '<password>'
# Create a remote vault for the wiki
ob sync-create-remote --name "LLM Wiki"
# Connect the wiki directory to the vault
cd ~/wiki
ob sync-setup --vault "<vault-id>"
# Initial sync
ob sync
# Continuous sync (foreground — use systemd for background)
ob sync --continuous
Continuous background sync via systemd:
# ~/.config/systemd/user/obsidian-wiki-sync.service
[Unit]
Description=Obsidian LLM Wiki Sync
After=network-online.target
Wants=network-online.target
[Service]
ExecStart=/path/to/ob sync --continuous
WorkingDirectory=/home/user/wiki
Restart=on-failure
RestartSec=10
[Install]
WantedBy=default.target
systemctl --user daemon-reload
systemctl --user enable --now obsidian-wiki-sync
# Enable linger so sync survives logout:
sudo loginctl enable-linger $USER
This lets the agent write to ~/wiki on a server while you browse the same
vault in Obsidian on your laptop/phone — changes appear within seconds.
Pitfalls
TAG GATE (HARD RULE — TRIGGERS ON EVERY WIKI WRITE) — When creating or updating ANY wiki page (entity, concept, comparison, query), every tag in the frontmatter MUST come from
wiki/SCHEMA.md's Tag Taxonomy section. Before writing a page: (1) read SCHEMA.md's taxonomy, (2) ensure every tag you plan to use exists there. If you need a genuinely new tag category, add it to SCHEMA.md FIRST, then use it. NEVER use ad-hoc tags. Composite kebab-case tags (5+ hyphen-joined words likecognition-devin-memory-tool-claude-code) are ALWAYS errors — decompose them into individual canonical tags. A pre-commit hook blocks commits with non-SCHEMA tags; if you see aTAG TAXONOMY VIOLATIONblock, fix the tags before committing.|pipe prefix corruption in log.md from patch — When usingpatchto append entries tolog.md, if theold_stringornew_stringcontains a trailing|character (from markdown table formatting or clipboard artifacts), that|gets written into the file as a standalone line. This produces entries like:- Total pages: 1834 | ## [2026-05-13] dreaming | ...Detection:
grep -n '^|$' wiki/log.mdfinds standalone pipe lines. Fix:sed -i '/^|$/d' wiki/log.mdor Pythoncontent = re.sub(r'\n\|\n', '\n', content). After fixing, verify withgrep -c '^|$' wiki/log.mdreturns 0. To prevent, never include bare|at the end ofnew_stringin patch calls — the|suffix from line-number-prefixed read_file output is the most common source.In-page content duplication (repeated sections in same file) — Entity pages can develop repeated sections when multiple ingest pipelines append the same topic (e.g., "The Mismeasure of Open Source" appearing 3 times in
entities/andrew-nesbitt.md). This happens when different pipelines enrich the same page without checking whether the section already exists. Detection:grep -c '^###'on a page to flag unusually high section counts for its subject breadth. For specific topics, search for repeated header strings. Fix: Use Python to find duplicate section boundaries:import re with open(path) as f: content = f.read() # Find duplicate sections by header text headers = re.findall(r'^### .+', content, re.MULTILINE) seen = {} for h in headers: if h in seen: print(f"DUPLICATE: {h}") seen[h] = True # Remove all but first occurrence of each duplicate section # Find the start of each duplicate (after first) and remove to next headerPrevention pipeline: When enriching an existing page with blog-post-specific content, always search the page for the article's title or key phrases first (
grep -ci "exact article title") before adding a new section. This applies to both automated cron pipelines and manual enrichments.SOURCES GATE (HARD RULE — TRIGGERS ON EVERY WIKI WRITE) — Every wiki page MUST have a
sources:field in its frontmatter, even if empty (sources: []). When creating a page from a raw article, setsources: [raw/articles/<filename>]. When creating from web extraction or synthesis, list the URLs:sources: [https://example.com/article]. Never omit the sources field — this was the single largest frontmatter gap discovered in 2026-05-13 audit (770+ of 810 broken pages were missingsources). If unsure, usesources: []rather than omitting it entirely.Never modify files in
raw/— sources are immutable. Corrections go in wiki pages.Always orient first — read SCHEMA + index + recent log before any operation in a new session. Skipping this causes duplicates and missed cross-references.
Always update index.md and log.md — skipping this makes the wiki degrade. These are the navigational backbone.
Don't create pages for passing mentions — follow the Page Thresholds in SCHEMA.md. A name appearing once in a footnote doesn't warrant an entity page.
Don't create pages without cross-references — isolated pages are invisible. Every page must link to at least 2 other pages.
Frontmatter is required — it enables search, filtering, and staleness detection.
Tags must come from the taxonomy — freeform tags decay into noise. Add new tags to SCHEMA.md first, then use them.
Keep pages scannable — a wiki page should be readable in 30 seconds. Split pages over 200 lines. Move detailed analysis to dedicated deep-dive pages.
Ask before mass-updating — if an ingest would touch 10+ existing pages, confirm the scope with the user first.
Rotate the log — when log.md exceeds 500 entries, rename it
log-YYYY.mdand start fresh. The agent should check log size during lint.Handle contradictions explicitly — don't silently overwrite. Note both claims with dates, mark in frontmatter, flag for user review.
execute_code file writes are sandboxed — use write_file for wiki pages:
execute_codewith Python'sopen(path, 'w')creates files in a temporary sandbox, NOT on the real filesystem. The script will report "CREATED: entities/foo.md" andos.path.exists()returns True, but the file won't exist when you trygit addorlsfrom terminal. For creating or modifying wiki pages, always usewrite_filewith absolute canonical paths (/opt/data/ai-topics/wiki/...). execute_code is still fine for reading, processing, analysis, and cross-referencing — just not for writing wiki files. This also applies to raw article files underwiki/raw/articles/. Recovery: after discovering the files are missing, usewrite_filewith the same content to recreate them on the real filesystem.Verify before claiming completion — always check that files actually exist after delegate_task or any async operation before telling the user work is done. Use
search_filesto confirm.Don't over-promise on article counts — when analyzing a large backlog (2,000+ articles), report the filtered count and processing plan first, don't claim pages are created until they're verified.
Subagent iteration budgets are real — max_iterations=20 allows ~5-7 page creations. For larger batches, split across multiple subagents or increase max_iterations.
search_files is unreliable for wiki directory discovery — it may return 0 results for files that definitely exist (e.g., entity pages like
simon-willison.md). Use Pythonos.walk()oros.path.exists()via execute_code instead when locating specific wiki pages.Entity file-to-identity check — slugs are ambiguous:
os.path.exists("pi.md")returning true does NOT mean Pinecone has an entity page — it could be the Pi coding agent. Similarly,contextarena.mdis a benchmark, not Arena the company. When checking whether an entity already has a page, always read the page's title/overview (not just check file existence) to confirm the slug maps to the expected entity. File-name collisions between unrelated entities sharing similar names are common in large wikis.Slug near-matches during cross-reference: when checking for duplicates via
os.path.exists(), a candidate slug likenvidia-nemotron-3won't match the canonicalnvidia-nemotron-3-nano-omni. The cross-reference script must also scanindex.mdfor near-match entries — grep for the candidate's core slug (e.g.,nvidia-nemotron) across the index before creating a new page. The index often has the canonical full name while your simplified candidate slug misses it. Usesearch_fileson the index with a substring of the candidate name as a secondary check.|-pipe table syntax corruption — automated tools or scripts can renderindex.mdentries as markdown table rows (|- [[entities/foo]] — summary). This corrupts ~200+ lines at scale. Detection:grep -c '^|- \[\[' wiki/index.md. Fix with Python (one-shot, all lines):import re with open("wiki/index.md") as f: content = f.read() fixed = re.sub(r'^\|-\\s+\[\[(?:entities|concepts|comparisons|queries)/', lambda m: '- [[' + m.group(0)[4:], content, flags=re.MULTILINE) with open("wiki/index.md", 'w') as f: f.write(fixed)Post-fix verification: commit must say
✓ wiki/index.md clean (N lines)and NOT flag anypipe-prefixed list itemwarnings.Index header count math is unreliable — "Total pages: N" frequently diverges from actual filesystem count (discrepancies of 600+ observed at 1,754 pages; Concepts section showed 459 vs actual 1,064). "Full entries" + "Stubs" may not sum to "Total". Always verify with
os.walk()orls ~/wiki/concepts/*.md | wc -lon the filesystem, never trust the header alone. Severity at scale: At 5,000+ pages, discrepancies compound further.Extension-less wikilinks to
_index.md— links like[[concepts/_index]]or[[agentic-engineering/_index]]will NOT resolve on case-sensitive filesystems. The target file is_index.md, not_index. Fix: append.mdor use[[agentic-engineering/_index|_index.md]]syntax.Log prepending via
patchis dangerous — two failure modes:① Header swallowing (this session): When prepending a new log entry before an existing
## [YYYY-MM-DD]entry, using that entry's header line as theold_stringanchor will silently delete the anchor header. Thepatchreplaces the matched text withnew_string, so the existing entry's header disappears and its sub-sections (### Pages Created,### Pages Updated) become orphaned under your new entry. Example of what happens: the file goes from## [2026-05-14] enrich | new entry\n\n### Pages Updated\n...to having the original## [2026-05-14] ingest | existing entryline gone entirely, with### Pages Createdand### Pages Updatednow misattributed to your enrich entry.② Duplication: When
new_stringaccidentally includes a second copy of the anchor line, both copies are written (already documented).Correct approach for log prepending: Do NOT use
patch— useexecute_codewith Python to insert lines at a known position. Pattern:with open(path) as f: lines = f.readlines() # Insert new entry after the header block # ⚠️ CRITICAL: Find where the header ends, do NOT hardcode line 7. # The # Wiki Log header + metadata lines are followed by a blank line, # then the first ## [YYYY-MM-DD] chronological entry. # Find the first ## [ entry to split header from chrono content: chrono_start = None for i, line in enumerate(lines): # Handle both ## [YYYY-MM-DD] and ## YYYY-MM-DD formats (mixed format) import re if line.startswith('## ['): chrono_start = i break if line.startswith('## ') and re.match(r'## \d{4}-\d{2}-\d{2}', line): chrono_start = i break # If still None, fall back to first ## line (edge case: no date-entries yet) if chrono_start is None: for i, line in enumerate(lines): if line.startswith('## '): chrono_start = i break # ⚠️ CRITICAL: Do NOT filter chrono_start by date. The first `## [` in the file is # the correct split point, even if it shares today's date. Adding `and not # line.startswith('## [2026-05-24]')` would skip same-day entries and place YOUR # new entry BELOW them instead of at the top of the chronological section, breaking # newest-first ordering. The chrono_start loop must be a simple first-match search # with no date-based exclusions. # header_block = lines[:chrono_start], chrono_entries = lines[chrono_start:] # ⚠️ AFTER writing, ALWAYS verify the header survived: # with open(path) as f: first_line = f.readline() # assert first_line.strip() == '# Wiki Log', f"Header buried! Got: {first_line.strip()}" header_block = lines[:chrono_start] new_entry = ["## [YYYY-MM-DD] enrich | subject\n", "\n", "### Pages Updated\n", ...] chrono_entries = lines[chrono_start:] lines = header_block + new_entry + ["\n", "---\n", "\n"] + chrono_entries with open(path, 'w') as f: f.writelines(lines)⚠️ WRITELINES TRAP:
f.readlines()vscontent.split('\n')— The code above usesf.readlines()which preserves trailing\non each line. If you instead usecontent.split('\n')to read the file, trailing newlines are stripped. Passing that stripped list tof.writelines()produces a corrupted file where lines merge together (e.g.# Wiki Log> Chronological record...Append-only.## [YYYY-MM-DD]). Usef.readlines()for readability in log prepending OR use'\n'.join()+ a singlef.write()call when working fromsplit('\n')output. The post-write assertion (assert f.readline().strip() == '# Wiki Log') catches this corruption immediately.Recovery from writelines/split-newline corruption: When the header merges into a single line, try one of these fix patterns:
# Light fix for simple merged-header cases content = content.replace('# Wiki Log>', '# Wiki Log\n>') content = content.replace('Append-only.##', 'Append-only.\n\n##') # Full reconstruction from git (for complex corruption) import subprocess result = subprocess.run(["git", "show", "HEAD:wiki/log.md"], capture_output=True, text=True, timeout=10)After fixing, verify with
head -4 wiki/log.mdthat# Wiki Log, the metadata line, a blank line, and## [YYYY-MM-DD]are on separate lines.⚠️ FAILURE MODE: DO NOT use
f.write(new_entry + content)— this simple prepend pattern buries the# Wiki Logheader becausecontentstarts with the header, and prependingnew_entrybefore it pushes the header to line 3+. The resulting file starts with## [YYYY-MM-DD]instead of# Wiki Log, violating the canonical structure. This happened in the 2026-05-20 health-fix session: the header ended up at line 37 and required a multi-step recovery (rescue header from middle of file, reconstruct with correct ordering). Always insert AFTER the header block using header/chrono splitting as shown above.⚠️ CRITICAL sub-pitfall — do NOT use
content.index()for position finding. When usingre.search()to find a match position, do NOT callcontent.index(match.group(2), match.start())to compute the replacement range.content.index()returns the FIRST occurrence of the substring from the given start position — but ifmatch.group(2)is a common pattern like## [YYYY-MM-DD], it will match the very first entry in the file (not the one adjacent to your search match), causing wholesale truncation of everything below the first entry. This was demonstrated in the 2026-05-20 watchdog session: attempting to remove an orphan###line usingcontent.index('## [', match.start())truncated 1,100 lines, requiringgit checkoutrecovery. Always extract the exact position from the match object itself: usematch.end(2) - len(match.group(2))or store the position in a variable from the regex match directly.⚠️ POST-WRITE VERIFICATION: After writing log.md, ALWAYS verify:
with open(path) as f: first_line = f.readline() assert first_line.strip() == '# Wiki Log', f"Header buried! First line: {first_line.strip()}" # Also verify: grep -c '^# Wiki Log' = 1, first ## [ entry follows header⚠️ SUB-PITFALL: Backtick-wrapped
## [in the Format line is achrono_startfalse positive. The> Format: \## [YYYY-MM-DD] action | subject`metadata line contains the literal string## [inside backtick code. When findingchrono_startviacontent.find("## [")orcontent.index("## [", start), this metadata match fires BEFORE any real chronological entry, causing the split to happen inside the backtick span. The result: the Format line is truncated to> Format: `(everything after the first backtick is lost). **Detection**:head -4 wiki/log.mdshows> Format: `instead of> Format: `## [YYYY-MM-DD] action | subject`. **Fix**: Use a regex that excludes backtick-wrapped matches, e.g.,re.search(r'(?<!`)## [', content)or split on the first## [that appears on its own line (after a\n, not after backtick on the Format line). **Prevention**: In thechrono_startdetection loop, skip lines where## [appears inside a backtick code span by checking the line starts with> ` (blockquote prefix used for metadata). Pattern:if line.startswith('## [') and not line.startswith('> '): # Skip > Format: `## [... chrono_start = i breakAfter any log.md write, always verify with
head -4 ~/wiki/log.mdthat the Format line is complete, not truncated.If you must use
patchinstead of execute_code, use a multi-line anchor spanning the gap between entries (e.g., the---separator + next entry's full header + first sub-section line) to ensure uniqueness and prevent swallowing. After any log edit, verify the first ~30 lines withread_file(limit=30)to confirm entries are properly separated.Verify log.md structure after rotation — log rotation scripts can produce a duplicate header section: the first
# Wiki Logheader + rotate entry +---separator, followed by a second# Wiki Logheader + duplicate rotate entry. Detection:grep -c "^# Wiki Log" wiki/log.mdmust return exactly 1. Fix: remove everything from the second# Wiki Logoccurrence to the beginning of the next legitimate## [YYYY-MM-DD]entry. Seewiki-graph-healthskill'sreferences/log-rotation.mdfor the full procedure and Python fix script.Missing
# Wiki Logheader (silent loss) — The entire# Wiki Logheader block (title + metadata lines) can go missing fromlog.mddue to priorexecute_codelog-prepending operations that use wrong line offsets. The file then starts directly with## [YYYY-MM-DD]entries. Detection:head -1 wiki/log.mdshould return# Wiki Log, not a date-stamped header. Recovery: use a largepatchthat inserts the header block + your new entry simultaneously before the first existing entry. Pattern:old_string= the first## [YYYY-MM-DD]entry's header line + first subsection,new_string= full# Wiki Logheader block + your new entry + the existing first entry. This fixes both the missing header and inserts your entry in a single atomic operation. Prevention: never assume line 7 is the insert point forexecute_codelog prepending — read the first 5 lines first to confirm the header is present.Watch for index entry drops when patching — when inserting into
index.mdviapatch, never guess what's between two anchor lines. Theold_stringmust match the file EXACTLY. If you assumerecursive-self-improvementimmediately follows the section header but the file actually hasacceleratein between,patchwill silently dropaccelerateand possibly create duplicates. Useread_filewith the exact offset to confirm the 2-3 lines forming yourold_stringanchor before callingpatch. After every indexpatch, verify the section withread_filebefore proceeding. Seereferences/index-insertion-discovery.mdfor the full pattern and failure case.Triple bracket corruption (
[[[): A variant discovered 2026-05-10 where index entries gain a third opening bracket:[[[concepts/foo]]instead of[[concepts/foo]]. This renders wikilinks unparseable by Obsidian/wiki tools. Detection:grep -c '[[\[' wiki/index.md. Fix:content.replace('[[[', '[[')— always safe since triple brackets are never intentional.Index entry points to wrong directory: Entry says
[[concepts/slug]]but file lives inentities/(or vice versa). Causes entry to appear "missing" from filesystem. Detection: For each[[dir/slug]], checkos.path.exists(wiki/dir/slug.md). If false, check the other namespace. Fix: Replace wrong namespace prefix with correct one. How this happens during active-crawl creation: The agent batch-creates pages across namespaces (e.g.,concepts/gemini-spark.mdandentities/qwen-3-7-max.md), then inserts each index entry next to alphabetically-adjacent entries. Ifgemini-sparkis alphabetically closest to entries in the Entities section (e.g.,gemini-3-5-flash,gemma-4), the agent naturally inserts it there — but writes[[entities/gemini-spark]]instead of[[concepts/gemini-spark]]. Prevention: After batch-creating pages, verify EACH index entry's namespace prefix matches the file's actual directory. In theexecute_codebatch, cross-check:for entry in new_entries: assert entry.startswith(f"[[{actual_dir}/")."Found 2 matches" patch failures due to list-marker inconsistency — when
patchreturns"Found 2 matches for old_string"butsearch_filesshows each line appearing only once, the culprit is often a mix of-(dash space) and|-(pipe dash space) list markers in index.md. Thepatchtool sees both patterns as valid anchor candidates even though they differ. Fix: (1) read the target section withread_file(offset=N, limit=6)to see the exact content characters, (2) use a 4-lineold_string(including one line before and after the target pair) to make the match unique, or (3) run the pipe-table fix recipe first (see|-pipe table syntax corruption pitfall) to normalize all list markers, then retry the patch.Newsletter triage JSON parsing failure — when newsletter-wiki-ingest receives
failed to parse JSON response from newsletter-triage output, the triage output is embedded in a markdown-wrapped cron output file. Fallback: read${HERMES_HOME}/cron/data/newsletter/triage_latest.jsondirectly — this is pure JSON saved by the triage job. If that's missing, grep the.mdoutput file for the JSON block after the## Responseheading. Same pattern for all pipelines — theblog-wiki-ingestanddreaming-wiki-ingestjobs face the identical issue. For blog:${HERMES_HOME}/cron/data/blog_ingest/triage_latest.json. For dreaming:${HERMES_HOME}/cron/data/dreaming/triage_latest.json. The upstream triage job saves JSON to the checkpoint path before the wiki-ingest runs; always read from the checkpoint path, not from the cron output file.web_extractblocked on public URLs (private network error) — some hosting platforms (Quartz, certain CDNs) trigger this. Use the decision tree inreferences/content-extraction-fallbacks.md: search for alternative mirrors → curl raw HTML → try.mdendpoint →browser_navigatelast resort.web_extractreturns only navigation chrome (silent failure) —web_extractreports success but returns only menus/footers/cards with zero article body. Common on JS-rendered official blogs (Google DeepMind, OpenAI, Anthropic). On headless servers without Chrome, useweb_searchtriangulation: query 2-3 angles to find third-party tech news coverage (MarkTechPost, The Register, The Decoder) that reproduces key quotes and data. Seereferences/content-extraction-fallbacks.mdstep 4 for the full pattern and example.Paywalled content is still usable — beehiiv/substack paywalled articles with free previews often contain enough technical claims (model names, chip specs, utilization numbers) to justify wiki updates. Cross-reference with non-paywalled sources. Mark uncertain claims with qualifiers ("reports suggest", "rumored").
web_extractLLM summarization truncation — long articles (especially news articles and Substacks) frequently trigger:LLM summarization timed out. To fix: increase auxiliary.web_extract.timeout in config.yaml, or use a faster auxiliary model. Use browser_navigate for the full page.When this happens, the first ~5,000 characters are still usable for extracting key facts. For richer content, usebrowser_navigateon the URL (produces a full text snapshot including all paragraphs, lists, and quotes) — but note thatbrowser_navigaterequires Chrome/Chromium installed (agent-browser install). On headless servers where Chrome isn't available, fall back to targetedweb_searchqueries for the specific model/product name to supplement the truncated content with specifications, benchmarks, and architecture details from alternative sources. Treatweb_extractas a fast initial pass andbrowser_navigate/web_searchas the fallback for pages that time out. For large arXiv papers (100K+ chars) whereweb_extracttimes out on the HTML tab: download the arXiv HTML version withcurl, extract section headings to understand structure, then use Python regex keyword search to find new/changed content. Full procedure inreferences/large-paper-extraction.md.updateddate bump frequently forgotten during execute_code enrichment — When enriching N existing entity/concept pages viaexecute_codePython string replacements (not viawrite_fileorpatch), theupdateddate in YAML frontmatter is easy to skip becausestr.replace()or ad-hoc section insertion doesn't naturally touch the frontmatter block. Verification stepupdated=Falsein the post-commit checks is the tell: if the entity page'supdatedfield wasn't bumped, the enrichment isn't recorded in the wiki's recency tracking. Fix: include an explicit date bump step in the sameexecute_codeblock — read the frontmatter, find theupdated:line, and replace it with today's date. Also add the bumped line to the file's sources field for proper source tracking. After writing, verifyhas_updated = "updated: YYYY-MM-DD" in contentin the same script. This applies to ALL entity/concept enrichments, not just execute_code — verify theupdatedfield for every page touched before committing.Tag taxonomy drift is inevitable at scale truncated content.
str.replace()failures due to file write-back —execute_codecalls are independent processes. Eachexecute_codecall starts a fresh Python process. If you modifycontentin one call (e.g., viastr.replace()), the next call reads the original file. Always write to disk in the SAMEexecute_codecall:with open(path) as f: content = f.read() content = content.replace(old, new) with open(path, 'w') as f: f.write(content)After writing, verify in the same call:
print(f"Wrote {len(content)} chars").For complex Python scripts with nested quotes, write to a temp file. When the script has multiple
str.replace()calls, f-strings with'''or""", or backtick-wrapped literals (URLs, raw article paths), inlineexecute_codeorterminal("python3 -c ...")can fail withSyntaxErroror bash backtick substitution. Write the script to/tmp/and run viaterminal():write_file(path="/tmp/fix_index.py", content=python_script) terminal("python3 /tmp/fix_index.py")Proven on 20-entry index batch (2026-05-27). kebab-case tags (e.g.,
cognition-devin-memory-tool-claude-code-competitive-analysis) as errors — these should be decomposed into separate tags. Add frequently-used legitimate tags to the taxonomy during lint, don't let them proliferate silently.Tag extraction from SCHEMA.md requires multi-pattern matching — the taxonomy section uses mixed formats: backtick-wrapped inline lists (
model, multimodal, text-generation), bullet points with comma-separated tags, and multi-line continuations. A simple regex like`([a-z-]+)`captures only ~60% of canonical tags. For accurate audits, parse each category header line, extract all backtick-wrapped AND comma-separated tokens, and handle line continuations. If your audit shows 150+ "violations" but SCHEMA.md clearly contains those tags, your extraction logic is too narrow.Case-sensitivity wikilink fix procedure — when
[[concepts/mooncake]]points to a file namedMooncake.mdon a case-sensitive filesystem:- Rename the file:
mv wiki/concepts/Mooncake.md wiki/concepts/mooncase.md - Stage both the deletion and addition:
git add wiki/concepts/Mooncake.md wiki/concepts/mooncake.md(Git needs to see both sides of the rename to track it properly) - Commit with descriptive message:
git commit -m "fix: rename Mooncake.md -> mooncase.md" - Fix all inbound wikilinks:
search_files '\[\[concepts/Mooncake\]\]'→ patch to[[concepts/mooncake]] - Push and verify in Obsidian/wiki tools that links resolve correctly
- Rename the file:
Index header count decays over time —
index.mdheader's "Total pages: N" quickly becomes stale as pages are created. During lint, always verify:len(os.listdir(entities/) + os.listdir(concepts/) + ...) == reported_count. Discrepancies of 100+ pages are common at scale and should be auto-corrected.Case sensitivity in wikilinks —
[[concepts/mooncake]]will NOT resolve toconcepts/Mooncake.mdon case-sensitive filesystems (Linux servers). During broken link checks, try case-insensitive basename matching as a secondary resolution strategy._index.md files are special — subdirectory
_index.mdfiles (e.g.,concepts/_index.md) serve as synthesis hubs and may legitimately lack full frontmatter. During lint, exclude files named_index.mdfrom frontmatter-required checks, or give them a minimal frontmatter withtype: indexandstatus: synthesis.TODO-stub duplicate cleanup — the dreaming pipeline and active-crawl jobs create index entries as TODO placeholders (
> **TODO**: Enrich this page). When real content is later written, the index accumulates duplicate entries: one rich + one TODO stub for the same page. During lint/dedup: (1) group index entries by wikilink target, (2) keep the rich entry (first occurrence), (3) remove all TODO-stub duplicates. Exclude_indexentries — they are legitimate per-directory. Automate with Python: scan for**- TODO**orEnrich this pagein index lines, match by wikilink target, remove duplicates bottom-up to preserve line numbers.Triage candidate_wiki_path may point to wrong namespace or slug — The triage pipeline generates
candidate_wiki_pathsuggestions based on topic keywords, not entity/concept namespace awareness or existing slug conventions. Three failure modes:- Wrong namespace: triage suggests
concepts/Xbutentities/Xexists (or vice versa). Checked by existing pitfall below. - Wrong slug (same namespace): triage suggests
nvidia-ai-q-deep-researchbut the existing page isnvidia-ai-q. Theos.path.exists()check returns False for the triage slug, so you'd create a duplicate. - Combined: triage suggests
concepts/nvidia-ai-q-deep-researchbut bothconcepts/nvidia-ai-q.mdANDentities/nvidia-ai-q.mdexist with different content breadth. Before creating a page from a triage suggestion: - Search the SAME namespace for near-matches: grep
index.mdfor the candidate's core slug (e.g., searchnvidia-aifor candidatenvidia-ai-q-deep-research). The index may have a canonical entry with a shorter or differently-structured slug. - Also search the OTHER namespace: if the suggestion is
concepts/X, search forentities/X(and vice versa). - Read any near-match pages to confirm whether they cover the topic substantively.
- If a near-match exists, enrich the existing page — merge triage content there and delete your duplicate. Never create a second page for the same entity under a different slug.
- Wrong namespace: triage suggests
Entity vs Concept overlap in index — some topics exist as both
entities/Xandconcepts/X(e.g.,entities/dspy+concepts/dspy). These are NOT duplicates if both files exist with different content. During dedup, only remove entries where the wikilink target is exactly the same string.[[entities/foo]]and[[concepts/foo]]are distinct.Non-alphabetical index drift causes programmatic insertion to fail — at scale (500+ pages), the index is NOT guaranteed to be in strict alphabetical order. Entries like
mit-encompassmay appear aftermoltbook-breach-2026because different agents and pipelines inserted them at different times. Any algorithm that scans for "first alphabetically-greater entry" will find the wrong insertion point in a non-alphabetical section, placing new entries in wildly incorrect positions (e.g., dumpingmodel-spec-*entries right afteraltman-three-observationsinstead of nearmcp). This creates duplicates that require manual cleanup. Always usepatchwith visually-confirmed anchor lines — read the target section withread_file(offset=N, limit=20), identify the exact 2-3 surrounding lines, and use those as theold_string. Never compute insertion points viaexecute_coderegex scanning.Comparison pages go in
comparisons/, notconcepts/— when ingesting a survey or comparison article that covers multiple items (e.g., "10 RL libraries compared"), the synthesis/comparison page belongs incomparisons/. Individual item pages (one per library/framework) go inconcepts/orentities/. Thetypefrontmatter field must match:type: comparisonfor comparison pages. If the user corrects placement, move the file and update ALL wikilinks across the wiki (search_filesto find references →patch(replace_all=true)), then adjust both section counts inindex.md.Stub enrichment cascade — when creating a new concept or entity page that links to an existing stub page (e.g.,
mixture-of-experts.mdwithstatus: stub), proactively enrich the stub's "Related Pages" section with backlinks to the new pages. This keeps the wiki graph connected even before stubs are fully enriched. Checksearch_filesforstatus: stubon targets before finalizing your batch.Tag violation commit block recovery — when the pre-commit hook blocks with
TAG TAXONOMY VIOLATIONS, the commit output shows the offending files and tags. Recovery procedure:- For each offending tag, check if there's a canonical equivalent in SCHEMA.md's taxonomy (e.g.,
enterprise→company,cost→economics,routing→optimization) - If a canonical equivalent exists: use
patchon each affected file to replace the non-canonical tag - If the tag is genuinely missing from SCHEMA.md (legitimate new tag used by an existing entity/concept): add it to the correct category in SCHEMA.md FIRST, then retry the commit. When patching SCHEMA.md, read the target section with
read_file(offset=N, limit=5)to get the exact characters — a common pitfall is turning- **Cat**:into|- **Cat**:by accidentally prepending a pipe. Verify the line starts with-not|-. - NEVER use
--no-verifyto bypass — this creates silent debt that compounds with every subsequent edit - After fixing all violations,
git addthe patched files and retry the commit
- For each offending tag, check if there's a canonical equivalent in SCHEMA.md's taxonomy (e.g.,
SCHEMA.md pipe prefix corruption from patch fuzzy matching — When using
patchto add tags to SCHEMA.md category lines, thepatchtool's fuzzy matching can convert- **Category**:to| **Category**:(dash-space → pipe-space) if theold_stringuses|(from read_file output) but the file actually has-. The result: category lines become pipe-prefixed markdown table rows instead of list items. This causes the pre-commit tag validator to misparse SCHEMA.md (reporting 298 tags instead of 460+, flagging legitimate tags as violations). Detection:grep -n '^| \*\*' wiki/SCHEMA.mdfinds corrupted lines. Fix:re.sub(r'^\| (\*\*)', r'- \1', content, flags=re.MULTILINE)via execute_code. Prevention: Always copy the exact line prefix from the file usingread_file(offset=N, limit=5)— never reconstruct the prefix from memory. The line-prefix character is-(dash space) for markdown list items, never|(pipe space). After any SCHEMA.md patch, verify no pipe-prefixed category lines withgrep -c '^| \*\*'. This is a distinct failure mode from the|-index.md pipe-table corruption — it affects list markup, not table syntax. Common mapping table for tags that tend to be missing from SCHEMA.md:Offending tag Canonical replacement claudeanthropic(product name, not a tag category)codexopenaienterprisecompanyorgovernancesandboxingsandboxtelemetryobservabilitycompliancegovernanceevalsevaluationcosteconomicsoroptimizationidentitysecurityor remove (overly broad)authorizationsecuritygptopenairoutingoptimizationorinferenceagent-orchestrationmulti-agentororchestrationstreamingstreaming(add to SCHEMA.md Techniques if missing)aiai(add to SCHEMA.md Meta if missing)web-developmentweb-development(add to SCHEMA.md Engineering if missing)developer-experiencedeveloper-experience(add to SCHEMA.md Engineering if missing)developer-toolsdeveloper-toolingagents-sdkai-agentsdeploymentdevopsorinfrastructuremodel-servinginferenceself-hostinginfrastructureorsecurityaudio-generationmultimodalorvoice-aicreativeremove (too generic; use existing domain tag) togethercompanysearch-apisearchorapiweb-searchsearchweb-monitoringsearchorobservabilitygoogle-alertsgoogleorsearchbing-apiapiortoolprogrammatic-monitoringautomationorobservabilitybrand-monitoringsearchai-agent-infrastructureai-infrastructuresoc2securityorgovernancezdrremove (unknown acronym, not a valid tag) gemmagoogle(product name, not a tag category)growtheconomicsor remove (too generic; use domain tag)governmentgovernanceorpolicycompetitorremove (not a useful category; use ecosystemif needed)speech-to-textvoice-aiormultimodaldecision-theoryrationalityoragent-foundationsproductivityautomationor remove (too generic)- Duplicate tag creation during normalization: when replacing a non-canonical tag with a canonical one (e.g.,
agents-sdk→ai-agents), check whether the canonical tag already exists in the page's tag list. Apatchthat replacesagents-sdkwithai-agentswhenai-agentsis already present will create a duplicate. Before patching, read the file's frontmatter to see which tags are already there; if the canonical replacement is present, remove the offending tag line entirely rather than replacing it.
- Duplicate tag creation during normalization: when replacing a non-canonical tag with a canonical one (e.g.,
Tag format mismatch — inline bracket vs YAML list: Wiki pages use two different tag formats in frontmatter. Inline bracket:
tags: [concept, mcp, agent-tooling, protocol]. YAML list with dashes:tags:\n - concept\n - mcp\n - agent-tooling\n. Astr.replace()targeting" - agent-tooling\n"will silently fail on a file using inline bracket format — the replacement string doesn't match, the file is unchanged, and the commit is blocked again with the same error. Always read the file's tag line first to determine which format it uses. For inline format, replace"agent-tooling, "with"developer-tooling, ". For YAML list format, replace" - agent-tooling\n"with" - developer-tooling\n". After fixing, verify withsearch_filesthat the offending tag no longer appears.- Pre-existing staged-file violations also block commit — when the pre-commit hook blocks with
TAG TAXONOMY VIOLATIONS, the offending files often include pages from OTHER pipelines that were already staged (e.g., blog-ingest or parallel pipelines). Do NOT fix only your own pages and retry — scan ALL files listed in the violation output and fix every one. The hook scans the entire staging area, not just your changes. Missing even one violation will cause another block on retry. Pattern:git status --shortafter staging to understand the full scope, then fix all tagged files before recommitting.
- Pre-existing staged-file violations also block commit — when the pre-commit hook blocks with
Patch silently drops lines not in old_string: when using
patchto update frontmatter or any multi-line block, theold_stringmust include EVERY line between the first and last anchor lines, not just the lines you intend to change. Ifold_stringskips a line (e.g., thealiasesline betweentypeandcreatedin frontmatter), that line is silently dropped from the file. Always read the target block first and include the complete stretch of lines inold_string. After patching frontmatter, verify the first ~10 lines withread_file(offset=1, limit=10)to ensure no fields were lost.Patch can create duplicate frontmatter fields — the inverse of the above pitfall. When using
patchto add fields (e.g., newtags:entries orsources:) to a frontmatter block, check whether those fields already exist above the matched block. If the file hastags:beforecreated:and yourold_stringmatchesupdated:throughsources:, addingtags:innew_stringwill create a secondtags:block belowsources:— YAML parsers may handle duplicate keys unpredictably and the pre-commit hook may flag it. Fix: read the full frontmatter first (not just the targeted area), confirm no field you plan to add already exists above, and if it does, expand yourold_stringto include the existing block sopatchcan replace it in-place. Always verify withread_file(limit=15)after any frontmatterpatch.Patch silently removes blank line between list item and heading — When using
patchto replace the LAST item in a bullet list, if the next line is a heading (## References) with no blank separator,patchdoes NOT add the missing blank line. The heading gets visually attached:- last item\n## Referencesinstead of- last item\n\n## References. Detection:read_fileshows a heading immediately after a list item. Fix: include\nbefore the heading innew_string:"- [[entities/x]] — description\n\n## References". Prevention: whenold_stringincludes the last list item + the heading, always insert\nbetween them innew_string. After anypatchon list boundaries, verify withread_file.
Lint Diagnostic Tools
The skill includes comprehensive diagnostic scripts at references/wiki-lint-diagnostics.md that:
- Detects broken wikilinks and orphan pages
- Identifies pipe table corruption in index.md
- Detects line-number prefix corruption in index.md (CRITICAL at scale)
- Validates frontmatter completeness
- Tracks tag taxonomy compliance
- Provides repair scripts for common issues
- Verifies index section completeness (Entities/Concepts/Comparisons/Queries)
Run these diagnostics during every lint session to maintain wiki health.
Lint Priority Order (from 2026-05-08 findings)
When running lint checks, execute in this order for maximum impact:
- Index corruption scan — check for
^\s*\d+|pattern in index.md. At 5,000+ pages, 80%+ of lines may be corrupted. - Section completeness — verify all 4 sections (Entities, Concepts, Comparisons, Queries) exist with proper headers.
- File count reconciliation —
os.walk()counts vs index header numbers. - Broken wikilink analysis — separate missing files from case mismatches from naming variations.
- Tag taxonomy audit — flag one-off tags, composite kebab tags, and tags not in SCHEMA.md.
- Frontmatter validation — check for missing required fields.
- Page size analysis — flag pages over 200 lines.
- Orphan page detection — pages with no inbound wikilinks.
- Index-to-filesystem gap — files on disk not listed in index.md (see pitfall below).
Index-to-Filesystem Reconciliation (CRITICAL at scale)
At 1,851 L2 pages, the gap between files on disk and index entries grows rapidly.
The wiki_health.py --json script reports orphan counts, but does NOT report files
missing from index.md. Run this reconciliation during every graph analysis:
import os, re
wiki = '/opt/data/wiki'
with open(os.path.join(wiki, 'index.md')) as f:
index_content = f.read()
# Extract all wikilinks from index
index_entries = re.findall(r'\[\[(entities|concepts|comparisons|queries|events)/([^|\]]+)', index_content)
indexed = set(f"{cat}/{slug.strip()}" for cat, slug in index_entries)
# Find all files on disk
all_files = set()
for cat in ['entities', 'concepts', 'comparisons', 'queries', 'events']:
d = os.path.join(wiki, cat)
if os.path.isdir(d):
for f in os.listdir(d):
if f.endswith('.md') and not f.startswith('_index'):
all_files.add(f"{cat}/{f.replace('.md','')}")
# Report gaps
not_indexed = all_files - indexed
not_on_disk = indexed - all_files
print(f"Files not in index: {len(not_indexed)}")
print(f"Index entries with no file: {len(not_on_disk)}")
Typical findings at scale (May 2026 baseline):
- 881 files not indexed (47% of L2) — mostly from subagent batches that skipped index update
- Events section often has files on disk but zero index entries — pipeline gap
- Concepts section has the largest gap (875 unindexed out of 1,242)
Fix strategy:
- For date-stamped news concepts (e.g.,
2026-04-24-gpt-5-5-chatgpt-images-2-0): batch archive to_archive/— these aren't durable concepts - For legitimate concepts: add index entries alphabetically per section
- For Events: ensure the section header exists and entries are added
- Never add all 875 at once — batch in groups of 50-100 and verify after each
Entity Duplicate Detection via Name Normalization
Entity duplicates accumulate from different pipelines using different slug conventions (hyphenated vs. not, blog URL vs. person name). Detect them with hyphen-stripping normalization:
from collections import defaultdict
duplicates = defaultdict(list)
for f in os.listdir(os.path.join(wiki, 'entities')):
if f.endswith('.md'):
normalized = f.replace('.md', '').lower().replace('-', '').replace('_', '')
duplicates[normalized].append(f.replace('.md', ''))
for key, files in sorted(duplicates.items()):
if len(files) > 1:
print(f"DUPLICATE: '{key}' → {files}")
Confirmed duplicates (May 2026):
deliberate-coder/deliberatecodereugene-yan/eugeneyanlilian-weng/lilianwengsamuel-colvin/samuelcolvin
Merge procedure:
- Pick canonical slug (prefer hyphenated form:
eugene-yan) - Read content from non-canonical page, merge into canonical
- Delete non-canonical file:
git rm wiki/entities/duplicate.md - Update all inbound wikilinks:
search_files '\[\[entities/duplicate\]\]'→ patch to canonical - Remove duplicate from index.md
- Commit and push
Mid-ingest variant: When you discover duplicates WHILE ingesting an article (not during a lint pass), follow the workflow in references/entity-consolidation-during-ingest.md. The key differences from lint-driven dedup are: (1) you already have the new article content to merge in, (2) you need to delete the new half-written page along with old duplicates, (3) git rename detection may conflate the delete+create with a rename — verify file history is preserved.
Index Deduplication
When maintaining a wiki at scale (1000+ pages), the index accumulates duplicate entries. Common causes: dreaming pipeline creating TODO stubs before pages exist, multiple ingest jobs creating entries for the same topic, or active-crawl jobs not checking existing coverage.
Automated Dedup Script
# Run via execute_code to identify and remove duplicates
import os, re
from collections import defaultdict
index_path = "/opt/data/ai-topics/wiki/index.md"
with open(index_path) as f:
lines = f.readlines()
# Group entries by wikilink target
entries = defaultdict(list)
for i, line in enumerate(lines):
match = re.match(r'- \[\[(.*?)(?:\||\])', line.strip())
if match:
key = match.group(1)
if key.endswith('/_index'):
continue # Legitimate per-directory entries
entries[key].append(i)
# Identify duplicates and remove TODO stubs
lines_to_remove = set()
for key, indices in entries.items():
if len(indices) > 1:
todo_lines = [i for i in indices if '**TODO**' in lines[i] or 'Enrich this page' in lines[i]]
rich_lines = [i for i in indices if i not in todo_lines]
# Keep first rich entry, remove all others
if rich_lines:
lines_to_remove.update(rich_lines[1:])
lines_to_remove.update(todo_lines)
# Remove bottom-up to preserve line numbers
for idx in sorted(lines_to_remove, reverse=True):
lines.pop(idx)
with open(index_path, 'w') as f:
f.writelines(lines)
print(f"Removed {len(lines_to_remove)} duplicate entries")
Dedup Rules
- Entity vs Concept overlap:
[[entities/foo]]and[[concepts/foo]]are NOT duplicates if both files exist with different content. Only remove entries where the wikilink target is exactly the same string. - _index exclusion: Subdirectory
_indexentries are legitimate and should never be removed during dedup. - Bottom-up removal: Always remove lines from highest index to lowest to preserve line numbers for remaining entries.
- Post-dedup verification: After dedup, verify the remaining entries point to actual files using
os.path.exists().
False-Positive Trap: Inline Cross-References in Descriptions
⚠️ Warning: The automated dedup script above is correct (it uses re.match() which captures only the primary wikilink at the start of each index line). However, if you write custom duplicate-detection code using re.findall() or re.finditer() across all wikilinks in each line, inline cross-references within description text will be falsely counted as duplicates.
How this happens: An entity page entry like - [[entities/andrew-chen]] — Andrew Chen — ... [[concepts/local-ai]] has TWO wikilinks. The primary is entities/andrew-chen, the inline [[concepts/local-ai]] is a cross-reference. A naive re.findall() scan would count entities/andrew-chen at this line AND also at line - [[concepts/local-ai]] — ... [[entities/andrew-chen]] (where it appears as an inline cross-ref in the local-ai entry). This produces 29+ false "duplicates" across a 1,300-line index.
Prevention — always filter by PRIMARY wikilink only:
# WRONG — catches ALL wikilinks in each line (inline cross-refs included)
all_links = re.findall(r'\[\[(entities|concepts)/([^|\]]+)\]\]', content)
link_counts = Counter(f"{cat}/{slug}" for cat, slug in all_links)
# 29 "duplicates" found — most are false positives
# CORRECT — only count the PRIMARY link (first wikilink in a line-starting entry)
for line in lines:
m = re.match(r'- \[\[(entities|concepts|comparisons|queries|events)/([^|\]]+)', line.strip())
if m:
key = f"{m.group(1)}/{m.group(2)}"
entry_positions[key].append((i, line))
# 0 duplicates found — all were inline cross-references
Verification: If your dedup scan says 25+ duplicates exist but the index looks well-maintained, investigate 3-5 examples. If they all follow the pattern of slug-A appearing in slug-B's description AND vice versa, the "duplicates" are legitimate cross-references. Open 3-5 paired entries and read the actual lines to confirm before accepting the count.
Scale impact: Observed in action on a 1,300-line wiki with 1,068 unique index entries. The re.findall approach flagged 29 duplicates. Investigation revealed ALL 29 were inline cross-references — zero true duplicates existed. Without this check, the dedup script would have removed legitimate cross-reference lines, breaking bidirectional wikilinks between related entity/concept pages.
Related Tools
llm-wiki-compiler is a Node.js CLI that compiles sources into a concept wiki with the same Karpathy inspiration. It's Obsidian-compatible, so users who want a scheduled/CLI-driven compile pipeline can point it at the same vault this skill maintains. Trade-offs: it owns page generation (replaces the agent's judgment on page creation) and is tuned for small corpora. Use this skill when you want agent-in-the-loop curation; use llmwiki when you want batch compile of a source directory.