Sweep recent AI safety research from curated sources (Anthropic alignment science / red team, OpenAI, GDM, Apollo, Redwood, METR, FAR AI, Truthful AI, alphaxiv, arXiv) and surface items matching tracked topic terms (inoculation prompting, reward hacking, exploration hacking, metagaming, eval gaming, OOCR, scheming, alignment faking, sandbagging, etc.). Use when asked to "sweep AI safety", "what's new in alignment", "any recent papers on X", "weekly safety digest", or for staying current on AI safety literature.
name: sweep-ai-safety
description: Sweep recent AI safety research from curated sources (Anthropic alignment science / red team, OpenAI, GDM, Apollo, Redwood, METR, FAR AI, Truthful AI, alphaxiv, arXiv) and surface items matching tracked topic terms (inoculation prompting, reward hacking, exploration hacking, metagaming, eval gaming, OOCR, scheming, alignment faking, sandbagging, etc.). Use when asked to "sweep AI safety", "what's new in alignment", "any recent papers on X", "weekly safety digest", or for staying current on AI safety literature.
Sweep AI Safety Research
A hybrid skill: a curated source registry + topic glossary Claude can consult during research, and a Python script that fetches feeds and produces a dated markdown digest of the last 7 days.
When to invoke
User asks: "what's new in AI safety", "sweep alignment research", "any recent papers on ", "weekly safety digest", "what did Anthropic/Apollo/Redwood post recently"
Before research planning — to check whether a question is already addressed by recent work
When a user mentions an unfamiliar term (inoculation prompting, OOCR, exploration hacking, metagaming) — consult terms.md
Periodic — schedule via /loop 7d /sweep-ai-safety or a cron routine
Quick start
# Default: last 7 days, all sources, markdown to stdout
uv run ~/.claude/skills/sweep-ai-safety/sweep.py
# Save to a dated file
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --output digest.md
# Wider window
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --since 30d
# Filter by topic term (matches in title/summary)
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --term "reward hacking"
# Single source
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --source anthropic-alignment
# arXiv keyword search (uses term registry by default)
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --arxiv-only
Architecture
sweep-ai-safety/
├── SKILL.md # this file
├── sources.yaml # source registry (orgs, blogs, feed URLs)
├── terms.md # topic-term glossary with short definitions
└── sweep.py # fetcher (PEP 723, uv run directly)
Source registry (sources.yaml)
One entry per source. Fields:
- key: anthropic-alignment # short id used in --source
org: Anthropic
name: Alignment Science blog
url: https://alignment.anthropic.com/
rss: https://alignment.anthropic.com/rss.xml # null if no feed
lane: rss | scrape # rss != null <=> lane: rss
kind: blog | papers | aggregator | researcher
verified: false # flip to true after first successful fetch
notes: ...
The script does NOT trust rss: blindly — on first run, it reports which feeds resolved and which need manual URL correction. Edit sources.yaml to fix.
Two-lane reality (RSS vs scrape)
Most curated orgs do not expose a working RSS feed. Live-verified state:
RSS lane (have a real feed, fetched by the feed reader): openai-blog,
openai-alignment, metr, redwood. These are the only sources that produce
items from a plain sweep.py run.
Scrape lane (rss: null, no usable feed — fetched via WebFetch on the landing
url): anthropic-alignment, anthropic-redteam, anthropic-research, openai-safety,
apollo, transluce, deepmind-safety, far-ai, truthful-ai, owain-evans,
alphaxiv-safety. Anthropic (both blogs) and Apollo are SPAs/404 — their /rss.xml
serves HTML, not XML.
arXiv (arxiv-terms): scrape-lane by the rss-null convention, but fetched by the
script's dedicated arXiv export-API path, not the generic WebFetch scrape step.
A scrape-lane source returning 0 items is EXPECTED from a plain sweep.py run —
the script only reads RSS feeds. Those sources surface only when the scrape step runs
(WebFetch the url). Treat "0 items from a scrape-lane source" as "not fetched", not
"nothing published". The scrape-lane fetch implementation lands in a later phase; until
then, scrape-lane sources are checked manually via WebFetch (see reference-mode workflow).
Zotero (dedup + sink)
Zotero is not a fetch source. It's the dedup reference (what's already been
collected) and the storage sink (where surfaced items land). It's handled by the
downstream pipeline, not by the sweep fetch itself — sweep.py neither reads from
nor writes to Zotero.
Topic glossary (terms.md)
A flat list of tracked terms with one-sentence definitions and (where useful) seminal paper anchors. Used:
By Claude as a reference when a user mentions a term — read the entry and the linked paper
By sweep.py for cross-reference: items mentioning any registered term are tagged in the digest
Add a new term: append to terms.md. Optionally add it to the arxiv_search_terms: list in sources.yaml so the script searches arXiv for it.
Reference-mode workflow (no script)
When the user mentions a term or asks about recent work without wanting a full sweep:
Open terms.md — does the term have a glossary entry? Read it
Open sources.yaml — identify the likely source(s) (e.g., scheming → Apollo; sandbagging → Anthropic alignment / METR; inoculation prompting → Truthful AI / Owain Evans)
Use WebSearch or WebFetch on the relevant source's blog / publications page
If a paper title is mentioned, look it up via arXiv API:
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --output "$HOME/scratch/safety-digest-$(utc_date).md"
Review the digest. Flag any failed fetches — they indicate URLs that drifted
Update sources.yaml if URLs need correction; mark verified: true for ones that worked
Read the high-signal items (matched-term or known-author hits) in full via WebFetch
Failure modes & gotchas
Issue
Why
Fix
Most sources return 0 items
RSS URL drifted or site has no feed
Open the org's blog page, find the actual feed URL, update sources.yaml. If no feed exists, the org has to be checked manually via WebFetch
arXiv requests get throttled
arXiv rate-limits at ~5 req/3s; sticky penalty 30-60s if exceeded
Script already batches arXiv term searches. If still throttled, wait 60s
Same paper appears under multiple sources
A paper can be on arXiv + an org blog + alphaxiv
Script dedupes by arXiv ID and by normalized title (case + punctuation collapsed). Subtitles or site-specific suffixes will still slip through — flag duplicates manually
Term doesn't match because of variant spelling
"OOCR" vs "out-of-context reasoning" vs "out of context reasoning"
Add aliases to the term entry in terms.md and the regex in sources.yaml
Sandbox blocks external HTTP
Most non-allowlisted hosts return connection error from Claude Code's sandbox
Run with dangerouslyDisableSandbox: true, or run from a normal shell
Item is in the right time window but old content
Some blogs republish/redate posts
Cross-check the canonical URL date; trust arXiv submittedDate over blog dates
Verification policy (research integrity)
This skill surfaces candidates. Always verify before acting on a finding:
Don't cite a paper from the digest without WebFetching the actual source
Don't claim a term is "from paper X" without checking — the script's glossary is a starting point, not ground truth
If the digest says "no items from in last 7d", that means either nothing was published OR the feed isn't working — distinguish before relying on the absence
Adapting
Add a source: append to sources.yaml, set verified: false, run sweep, update if it works
Add a tracked term: append to terms.md with a one-line definition; optionally add to sources.yamlarxiv_search_terms: for arXiv inclusion
Different cadence: pass --since 14d / --since 30d; or schedule via /loop 14d /sweep-ai-safety (or as a routine via /schedule)
JSON output for programmatic use: pass --json (emits one item per line, NDJSON)
Conference papers: the weekly sweep relies on arXiv catching venue cross-posts —
tag an item with its venue when the arXiv comment field names one (e.g. "Accepted at
NeurIPS 2025"). A separate episodic --conference <venue> roundup mode (scraping
accepted-paper lists plus safety workshops like SoLaR) is planned for when proceedings
drop — NeurIPS/ICLR/ICML land ~3x/year, not weekly, so it doesn't belong in the weekly cadence.