raw-article-filename-policy

name: raw-article-filename-policy description: Naming convention for raw article files in wiki/raw/articles/ — always use the actual publication date, not the date of ingestion. trigger: When saving a raw article to wiki/raw/articles/ — before writing any article file, determine the real publication date and construct the filename accordingly.

Raw Article Filename Policy

Academic Papers (`wiki/raw/papers/`)

Filename = {YYYY-MM-DD}_{arxiv-id}_{short-title}.md

Where:

YYYY-MM-DD = the publication date (or first arXiv submission date if exact date unknown)
arxiv-id = arXiv ID without version suffix (e.g., 2512.24601, NOT 2512.24601v2)
short-title = 2-5 hyphenated keywords (lowercase, 40 chars max)

For non-arXiv papers (conference proceedings, tech reports without arXiv IDs):

Filename = {YYYY-MM-DD}_{source}_{short-title}.md
Where source = acl, emnlp, neurips, icml, meta, deepseek, etc.

Dedup Check (MANDATORY before saving)

Before saving ANY new paper to wiki/raw/papers/, run:

python3 scripts/papers_index.py --check <arxiv-url-or-id>

If it returns DUPLICATE, DO NOT create a new file — update the existing one instead. If you create a new file, register it: python3 scripts/papers_index.py --add <filename> <url>

The index is at wiki/raw/papers/.papers_index.json and maps identifiers to canonical filenames.

Full pipeline reference: references/papers_index_workflow.md — step-by-step from date discovery through commit/push.

Examples

Paper	Source Date	Filename
Recursive Language Models (arXiv:2512.24601)	2025-12-31	`2025-12-31_2512.24601_recursive-language-models.md`
DeepSeek-V3 (arXiv:2412.19437)	2024-12-27	`2024-12-27_2412.19437_deepseek-v3-technical-report.md`
GEPA (arXiv:2507.19457)	2025-07-25	`2025-07-25_2507.19457_gepa-reflective-prompt-evolution.md`
DeepSeek-V4 (HuggingFace)	2026-04-xx	`2026-04-xx_deepseek-v4-technical-report.md`

Lecture Transcripts (`wiki/raw/transcripts/`)

Filename = {YYYY-MM-DD}_{source-slug}_{content-slug}-lecture.md

Same date and source-slug rules as articles. Add -lecture suffix to distinguish from the companion slides article.

Frontmatter: type: transcript, include date (lecture date) and date_ingested
Always include related_article: pointing to the companion slides article wikilink path
Example: 2026-06-02_softwaredoug_cheat-at-search-long-running-search-lecture.md
- Companion: articles/2026-06-02_softwaredoug_cheat-at-search-long-running-search.md

Original `wiki/raw/articles/` Policy

Filename = {YYYY-MM-DD}_{source-slug}_{content-slug}.md

Where:

YYYY-MM-DD = the ACTUAL publication date of the article (verified from the source, NOT today's date)
source-slug = abbreviated domain or source name (lowercase, no dots: interconnects, pelayoarbues, simonwillison, arxiv, anthropic)
content-slug = 2-5 word descriptive title (lowercase, hyphenated, 30 chars max)

Transcripts (`wiki/raw/transcripts/`)

Filename = {YYYY-MM-DD}_{source-slug}_{content-slug}.md (same pattern as articles)

Where:

YYYY-MM-DD = the lecture/presentation date (not ingestion date)
source-slug = author handle (e.g., softwaredoug)
content-slug = descriptive slug
type: transcript in frontmatter
See wiki-entity-enrichment-from-article → references/transcript-ingestion.md for full workflow

Examples

Article	Source Date	Filename
Nathan Lambert on synthetic data (interconnects.ai)	2023-11-29	`2023-11-29_interconnects-llm-synthetic-data.md`
Pelayo Arbués on dataset engineers (pelayoarbues.com)	2025-01-16	`2025-01-16_pelayoarbues-dataset-engineer.md`
Megadocs synthetic pretraining (arXiv)	2026-03-19	`2026-03-19_megadocs-synthetic-pretraining.md`
Khairallah context engineering course (X Article)	2026-05-10	`2026-05-10_engkhairallah_context-engineering-master-course.md`

X Article / X Note Tweet Naming

For X Article and X Note Tweet raw articles:

source-slug = the X handle without @ (e.g., eng_khairallah1 → engkhairallah1), underscores stripped
date = created_at from the parent tweet's API response
Frontmatter must include type: x_article or type: x_note_tweet
Example: 2026-05-10_engkhairallah_context-engineering-master-course.md

How to Find the Real Publication Date

Priority Order

Meta tags in HTML — <meta property="article:published_time">, <time datetime="...">, JSON-LD datePublished
Page text — Look for "Published", "Posted", "Updated" dates in article header/footer
URL patterns — Some blogs embed dates in URL (e.g., /2024/01/15/title)
RSS feed — <pubDate> or <dc:date> in the feed XML
Search engine snippet — Google/Bing search results often show publication dates
Social media cross-post — Check the X/Twitter or LinkedIn post announcing the article. For LinkedIn posts specifically, extract datePublished from JSON-LD SocialMediaPosting — see references/linkedin-date-extraction.md.

Verification Steps

import urllib.request, re

url = "https://example.com/article"
resp = urllib.request.urlopen(url, timeout=10)
html = resp.read().decode("utf-8")

# Check meta tags
for m in re.finditer(r'<meta[^>]*property=["\']article:published_time["\'][^>]*content=["\']([^"\']+)["\']', html):
    print("Meta published:", m.group(1))  # 2023-11-29T...

# Check time tags
for m in re.finditer(r'<time[^>]*datetime=["\']([^"\']+)["\']', html):
    print("Datetime attr:", m.group(1))

# Check JSON-LD
for m in re.finditer(r'"datePublished"\s*:\s*"([^"]+)"', html):
    print("JSON-LD datePublished:", m.group(1))

Raw Article Rename Workflow

When a raw article filename needs correction (wrong date, typo, etc.), follow this reference-update pattern. See references/raw-article-rename-workflow.md for the full checklist.

Quick version:

search_files for the old filename across wiki/ to find ALL references
terminal mv to rename the file
Update frontmatter (date, etc.) via patch
Batch-update all references in index.md, log.md, entities/, concepts/, raw/transcripts/
Leave archived triage JSONs untouched (historical snapshots)
Append a rename log entry to log.md
Commit + push

User preference: When correcting dates or metadata, the user expects a thorough reference update — not just the frontmatter, but all wikilinks, index entries, log entries, and cross-references in related pages.

Pitfalls

DO NOT use today's date — the filename must reflect when the content was published, not when you processed it
Blogwatcher URLs may have incorrect slugs — the URL from blogwatcher DB may 404 (e.g., "with-open-source-models" vs "without-regrets"). Before saving a raw article, fetch the URL to verify it resolves. If 404, check the source's RSS feed for the correct URL. Do NOT save with an unverified blogwatcher URL.
Source-slug from blogwatcher: When using blogwatcher-discovered URLs, the source-slug should come from the verified domain (together-ai, google-developers), not from the potentially inaccurate blogwatcher URL path.
Slides/presentations — Google Slides, Canva, Figma decks, and similar presentation formats rarely embed a creation date in metadata. Use the ingestion date as the filename date, but add date_ingested: YYYY-MM-DD in the YAML frontmatter (not date or date_published) to distinguish it from a verified publication date.
Newsletters — If the article is from a newsletter digest, the newsletter date ≠ article date. Find the article's original publication date on the source site.
Republished content — If an article was republished on a different platform (e.g., cross-posted to Substack and Medium), use the EARLIEST publication date.
Updated articles — Some articles show "Last updated" instead of "Published". Use the original published date; optionally add _updated-{YYYY-MM-DD} suffix if the update was substantial.
No date found — Search the article title + author on Google/Bing. If still no date, use YYYY-MM-DD_unknown_{slug}.md and add a date_unknown: true frontmatter flag.
Archive/wayback date — As last resort, use the Internet Archive's first capture date.
Blog posts vs. raw articles — wiki/raw/articles/ is exclusively for externally-sourced finalized article scrapes ingested from newsletters, RSS, web, etc. Original blog posts (Hermes-authored content, user-requested essays) go to blog/ at the repo root (~/ai-topics/blog/). Do NOT write original blog posts into wiki/raw/articles/. The blog/ directory uses a simpler naming convention: {YYYY-MM-DD}_{author}_{short-slug}.md.