sweep-ai-safety

name: sweep-ai-safety description: Sweep recent AI safety research from curated sources (Anthropic alignment science / red team, OpenAI, GDM, Apollo, Redwood, METR, FAR AI, Truthful AI, alphaxiv, arXiv) and surface items matching tracked topic terms (inoculation prompting, reward hacking, exploration hacking, metagaming, eval gaming, OOCR, scheming, alignment faking, sandbagging, etc.). Use when asked to "sweep AI safety", "what's new in alignment", "any recent papers on X", "weekly safety digest", or for staying current on AI safety literature.

Sweep AI Safety Research

A hybrid skill: a curated source registry + topic glossary Claude can consult during research, and a Python script that fetches feeds and produces a dated markdown digest of the last 7 days.

When to invoke

User asks: "what's new in AI safety", "sweep alignment research", "any recent papers on ", "weekly safety digest", "what did Anthropic/Apollo/Redwood post recently"
Before research planning — to check whether a question is already addressed by recent work
When a user mentions an unfamiliar term (inoculation prompting, OOCR, exploration hacking, metagaming) — consult terms.md
Periodic — schedule via /loop 7d /sweep-ai-safety or a cron routine

Quick start

# Default: last 7 days, all sources, markdown to stdout
uv run ~/.claude/skills/sweep-ai-safety/sweep.py

# Save to a dated file
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --output digest.md

# Wider window
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --since 30d

# Filter by topic term (matches in title/summary)
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --term "reward hacking"

# Single source
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --source anthropic-alignment

# arXiv keyword search (uses term registry by default)
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --arxiv-only

Architecture

sweep-ai-safety/
├── SKILL.md          # this file
├── sources.yaml      # source registry (orgs, blogs, feed URLs)
├── terms.md          # topic-term glossary with short definitions
└── sweep.py          # fetcher (PEP 723, uv run directly)

Source registry (`sources.yaml`)

One entry per source. Fields:

- key: anthropic-alignment        # short id used in --source
  org: Anthropic
  name: Alignment Science blog
  url: https://alignment.anthropic.com/
  rss: https://alignment.anthropic.com/rss.xml   # null if no feed
  lane: rss | scrape               # rss != null <=> lane: rss
  kind: blog | papers | aggregator | researcher
  verified: false                  # flip to true after first successful fetch
  notes: ...

The script does NOT trust rss: blindly — on first run, it reports which feeds resolved and which need manual URL correction. Edit sources.yaml to fix.

Two-lane reality (RSS vs scrape)

Most curated orgs do not expose a working RSS feed. Live-verified state:

RSS lane (have a real feed, fetched by the feed reader): openai-blog, openai-alignment, metr, redwood. These are the only sources that produce items from a plain sweep.py run.
Scrape lane (rss: null, no usable feed — fetched via WebFetch on the landing url): anthropic-alignment, anthropic-redteam, anthropic-research, openai-safety, apollo, transluce, deepmind-safety, far-ai, truthful-ai, owain-evans, alphaxiv-safety. Anthropic (both blogs) and Apollo are SPAs/404 — their /rss.xml serves HTML, not XML.
arXiv (arxiv-terms): scrape-lane by the rss-null convention, but fetched by the script's dedicated arXiv export-API path, not the generic WebFetch scrape step.

A scrape-lane source returning 0 items is EXPECTED from a plain sweep.py run — the script only reads RSS feeds. Those sources surface only when the scrape step runs (WebFetch the url). Treat "0 items from a scrape-lane source" as "not fetched", not "nothing published". The scrape-lane fetch implementation lands in a later phase; until then, scrape-lane sources are checked manually via WebFetch (see reference-mode workflow).

Zotero (dedup + sink)

Zotero is not a fetch source. It's the dedup reference (what's already been collected) and the storage sink (where surfaced items land). It's handled by the downstream pipeline, not by the sweep fetch itself — sweep.py neither reads from nor writes to Zotero.

Topic glossary (`terms.md`)

A flat list of tracked terms with one-sentence definitions and (where useful) seminal paper anchors. Used:

By Claude as a reference when a user mentions a term — read the entry and the linked paper
By sweep.py for cross-reference: items mentioning any registered term are tagged in the digest

Add a new term: append to terms.md. Optionally add it to the arxiv_search_terms: list in sources.yaml so the script searches arXiv for it.

Reference-mode workflow (no script)

When the user mentions a term or asks about recent work without wanting a full sweep:

Open terms.md — does the term have a glossary entry? Read it
Open sources.yaml — identify the likely source(s) (e.g., scheming → Apollo; sandbagging → Anthropic alignment / METR; inoculation prompting → Truthful AI / Owain Evans)
Use WebSearch or WebFetch on the relevant source's blog / publications page
If a paper title is mentioned, look it up via arXiv API:
- curl -sL "http://export.arxiv.org/api/query?search_query=ti:%22<title%20words>%22&max_results=5" (Atom XML)

Sweep-mode workflow (script)

uv run ~/.claude/skills/sweep-ai-safety/sweep.py --output "$HOME/scratch/safety-digest-$(utc_date).md"
Review the digest. Flag any failed fetches — they indicate URLs that drifted
Update sources.yaml if URLs need correction; mark verified: true for ones that worked
Read the high-signal items (matched-term or known-author hits) in full via WebFetch

Failure modes & gotchas

Issue	Why	Fix
Most sources return 0 items	RSS URL drifted or site has no feed	Open the org's blog page, find the actual feed URL, update `sources.yaml`. If no feed exists, the org has to be checked manually via WebFetch
arXiv requests get throttled	arXiv rate-limits at ~5 req/3s; sticky penalty 30-60s if exceeded	Script already batches arXiv term searches. If still throttled, wait 60s
Same paper appears under multiple sources	A paper can be on arXiv + an org blog + alphaxiv	Script dedupes by arXiv ID and by normalized title (case + punctuation collapsed). Subtitles or site-specific suffixes will still slip through — flag duplicates manually
Term doesn't match because of variant spelling	"OOCR" vs "out-of-context reasoning" vs "out of context reasoning"	Add aliases to the term entry in `terms.md` and the regex in `sources.yaml`
Sandbox blocks external HTTP	Most non-allowlisted hosts return connection error from Claude Code's sandbox	Run with `dangerouslyDisableSandbox: true`, or run from a normal shell
Item is in the right time window but old content	Some blogs republish/redate posts	Cross-check the canonical URL date; trust arXiv `submittedDate` over blog dates

Verification policy (research integrity)

This skill surfaces candidates. Always verify before acting on a finding:

Don't cite a paper from the digest without WebFetching the actual source
Don't claim a term is "from paper X" without checking — the script's glossary is a starting point, not ground truth
If the digest says "no items from in last 7d", that means either nothing was published OR the feed isn't working — distinguish before relying on the absence

Adapting

Add a source: append to sources.yaml, set verified: false, run sweep, update if it works
Add a tracked term: append to terms.md with a one-line definition; optionally add to sources.yaml arxiv_search_terms: for arXiv inclusion
Different cadence: pass --since 14d / --since 30d; or schedule via /loop 14d /sweep-ai-safety (or as a routine via /schedule)
JSON output for programmatic use: pass --json (emits one item per line, NDJSON)
Conference papers: the weekly sweep relies on arXiv catching venue cross-posts — tag an item with its venue when the arXiv comment field names one (e.g. "Accepted at NeurIPS 2025"). A separate episodic --conference <venue> roundup mode (scraping accepted-paper lists plus safety workshops like SoLaR) is planned for when proceedings drop — NeurIPS/ICLR/ICML land ~3x/year, not weekly, so it doesn't belong in the weekly cadence.