name: raw-article-filename-policy description: Naming convention for raw article files in wiki/raw/articles/ — always use the actual publication date, not the date of ingestion. trigger: When saving a raw article to wiki/raw/articles/ — before writing any article file, determine the real publication date and construct the filename accordingly.
Raw Article Filename Policy
Academic Papers (wiki/raw/papers/)
Filename = {YYYY-MM-DD}_{arxiv-id}_{short-title}.md
Where:
YYYY-MM-DD= the publication date (or first arXiv submission date if exact date unknown)arxiv-id= arXiv ID without version suffix (e.g.,2512.24601, NOT2512.24601v2)short-title= 2-5 hyphenated keywords (lowercase, 40 chars max)
For non-arXiv papers (conference proceedings, tech reports without arXiv IDs):
- Filename =
{YYYY-MM-DD}_{source}_{short-title}.md - Where
source=acl,emnlp,neurips,icml,meta,deepseek, etc.
Dedup Check (MANDATORY before saving)
Before saving ANY new paper to wiki/raw/papers/, run:
python3 scripts/papers_index.py --check <arxiv-url-or-id>
If it returns DUPLICATE, DO NOT create a new file — update the existing one instead.
If you create a new file, register it: python3 scripts/papers_index.py --add <filename> <url>
The index is at wiki/raw/papers/.papers_index.json and maps identifiers to canonical filenames.
Full pipeline reference: references/papers_index_workflow.md — step-by-step from date discovery through commit/push.
Examples
| Paper | Source Date | Filename |
|---|---|---|
| Recursive Language Models (arXiv:2512.24601) | 2025-12-31 | 2025-12-31_2512.24601_recursive-language-models.md |
| DeepSeek-V3 (arXiv:2412.19437) | 2024-12-27 | 2024-12-27_2412.19437_deepseek-v3-technical-report.md |
| GEPA (arXiv:2507.19457) | 2025-07-25 | 2025-07-25_2507.19457_gepa-reflective-prompt-evolution.md |
| DeepSeek-V4 (HuggingFace) | 2026-04-xx | 2026-04-xx_deepseek-v4-technical-report.md |
Lecture Transcripts (wiki/raw/transcripts/)
Filename = {YYYY-MM-DD}_{source-slug}_{content-slug}-lecture.md
Same date and source-slug rules as articles. Add -lecture suffix to distinguish from the companion slides article.
- Frontmatter:
type: transcript, includedate(lecture date) anddate_ingested - Always include
related_article:pointing to the companion slides article wikilink path - Example:
2026-06-02_softwaredoug_cheat-at-search-long-running-search-lecture.md- Companion:
articles/2026-06-02_softwaredoug_cheat-at-search-long-running-search.md
- Companion:
Original wiki/raw/articles/ Policy
Filename = {YYYY-MM-DD}_{source-slug}_{content-slug}.md
Where:
YYYY-MM-DD= the ACTUAL publication date of the article (verified from the source, NOT today's date)source-slug= abbreviated domain or source name (lowercase, no dots:interconnects,pelayoarbues,simonwillison,arxiv,anthropic)content-slug= 2-5 word descriptive title (lowercase, hyphenated, 30 chars max)
Transcripts (wiki/raw/transcripts/)
Filename = {YYYY-MM-DD}_{source-slug}_{content-slug}.md (same pattern as articles)
Where:
YYYY-MM-DD= the lecture/presentation date (not ingestion date)source-slug= author handle (e.g.,softwaredoug)content-slug= descriptive slugtype: transcriptin frontmatter- See
wiki-entity-enrichment-from-article→references/transcript-ingestion.mdfor full workflow
Examples
| Article | Source Date | Filename |
|---|---|---|
| Nathan Lambert on synthetic data (interconnects.ai) | 2023-11-29 | 2023-11-29_interconnects-llm-synthetic-data.md |
| Pelayo Arbués on dataset engineers (pelayoarbues.com) | 2025-01-16 | 2025-01-16_pelayoarbues-dataset-engineer.md |
| Megadocs synthetic pretraining (arXiv) | 2026-03-19 | 2026-03-19_megadocs-synthetic-pretraining.md |
| Khairallah context engineering course (X Article) | 2026-05-10 | 2026-05-10_engkhairallah_context-engineering-master-course.md |
X Article / X Note Tweet Naming
For X Article and X Note Tweet raw articles:
source-slug= the X handle without@(e.g.,eng_khairallah1→engkhairallah1), underscores strippeddate=created_atfrom the parent tweet's API response- Frontmatter must include
type: x_articleortype: x_note_tweet - Example:
2026-05-10_engkhairallah_context-engineering-master-course.md
How to Find the Real Publication Date
Priority Order
- Meta tags in HTML —
<meta property="article:published_time">,<time datetime="...">, JSON-LDdatePublished - Page text — Look for "Published", "Posted", "Updated" dates in article header/footer
- URL patterns — Some blogs embed dates in URL (e.g.,
/2024/01/15/title) - RSS feed —
<pubDate>or<dc:date>in the feed XML - Search engine snippet — Google/Bing search results often show publication dates
- Social media cross-post — Check the X/Twitter or LinkedIn post announcing the article. For LinkedIn posts specifically, extract
datePublishedfrom JSON-LDSocialMediaPosting— seereferences/linkedin-date-extraction.md.
Verification Steps
import urllib.request, re
url = "https://example.com/article"
resp = urllib.request.urlopen(url, timeout=10)
html = resp.read().decode("utf-8")
# Check meta tags
for m in re.finditer(r'<meta[^>]*property=["\']article:published_time["\'][^>]*content=["\']([^"\']+)["\']', html):
print("Meta published:", m.group(1)) # 2023-11-29T...
# Check time tags
for m in re.finditer(r'<time[^>]*datetime=["\']([^"\']+)["\']', html):
print("Datetime attr:", m.group(1))
# Check JSON-LD
for m in re.finditer(r'"datePublished"\s*:\s*"([^"]+)"', html):
print("JSON-LD datePublished:", m.group(1))
Raw Article Rename Workflow
When a raw article filename needs correction (wrong date, typo, etc.), follow this reference-update pattern. See references/raw-article-rename-workflow.md for the full checklist.
Quick version:
search_filesfor the old filename acrosswiki/to find ALL referencesterminal mvto rename the file- Update frontmatter (
date, etc.) viapatch - Batch-update all references in
index.md,log.md,entities/,concepts/,raw/transcripts/ - Leave archived triage JSONs untouched (historical snapshots)
- Append a rename log entry to
log.md - Commit + push
User preference: When correcting dates or metadata, the user expects a thorough reference update — not just the frontmatter, but all wikilinks, index entries, log entries, and cross-references in related pages.
Pitfalls
- DO NOT use today's date — the filename must reflect when the content was published, not when you processed it
- Blogwatcher URLs may have incorrect slugs — the URL from blogwatcher DB may 404 (e.g., "with-open-source-models" vs "without-regrets"). Before saving a raw article, fetch the URL to verify it resolves. If 404, check the source's RSS feed for the correct URL. Do NOT save with an unverified blogwatcher URL.
- Source-slug from blogwatcher: When using blogwatcher-discovered URLs, the
source-slugshould come from the verified domain (together-ai,google-developers), not from the potentially inaccurate blogwatcher URL path. - Slides/presentations — Google Slides, Canva, Figma decks, and similar presentation formats rarely embed a creation date in metadata. Use the ingestion date as the filename date, but add
date_ingested: YYYY-MM-DDin the YAML frontmatter (notdateordate_published) to distinguish it from a verified publication date. - Newsletters — If the article is from a newsletter digest, the newsletter date ≠ article date. Find the article's original publication date on the source site.
- Republished content — If an article was republished on a different platform (e.g., cross-posted to Substack and Medium), use the EARLIEST publication date.
- Updated articles — Some articles show "Last updated" instead of "Published". Use the original published date; optionally add
_updated-{YYYY-MM-DD}suffix if the update was substantial. - No date found — Search the article title + author on Google/Bing. If still no date, use
YYYY-MM-DD_unknown_{slug}.mdand add adate_unknown: truefrontmatter flag. - Archive/wayback date — As last resort, use the Internet Archive's first capture date.
- Blog posts vs. raw articles —
wiki/raw/articles/is exclusively for externally-sourced finalized article scrapes ingested from newsletters, RSS, web, etc. Original blog posts (Hermes-authored content, user-requested essays) go toblog/at the repo root (~/ai-topics/blog/). Do NOT write original blog posts intowiki/raw/articles/. Theblog/directory uses a simpler naming convention:{YYYY-MM-DD}_{author}_{short-slug}.md.