save-to-academic-notion - SKILL.md Agent Skill

name: save-to-academic-notion description: Save a URL/arXiv link into Selina's Notion Academic database as a new entry if missing, and populate metadata (title, authors, abstract). Supports arXiv papers (auto-fetch), non-arXiv URLs (web scraping fallback), and manual mode for agent-assisted saves. Use when the user says "save to notion", "add this paper", "save this arxiv", "put this link in my Academic database", or asks to create a new paper entry from a URL.

Save to Academic Notion (Selina)

Quick Usage (Automated)

Preferred method - use the Python script:

cd ~/.openclaw/workspace/skills/save-to-academic-notion
./save_paper.py <arxiv_id_or_url>

Examples:

# arXiv papers (auto-fetch metadata)
./save_paper.py 2501.12345
./save_paper.py https://arxiv.org/abs/2501.12345
./save_paper.py https://arxiv.org/pdf/2501.12345.pdf

# Non-arXiv URLs (web scraping fallback)
./save_paper.py https://openreview.net/forum?id=abc123

# Manual mode (when auto-extraction fails)
./save_paper.py --manual \
  --title "Paper Title" \
  --url "https://example.com/paper" \
  --authors "Alice; Bob; Charlie" \
  --abstract "This paper presents..."

# JSON output (for automation)
./save_paper.py 2501.12345 --json

What it does:

Auto-fetches metadata from arXiv API (title, authors, abstract, date)
For non-arXiv URLs: attempts web scraping fallback
Checks for duplicates (via Link property)
Creates page with proper properties and abstract as quote block
Uses "Authors txt" field (semicolon-separated text, not multi-select)
Returns Notion page URL

Output:

Fetching metadata for 2501.12345...
Creating Notion page for: Paper Title...
Authors: Author One, Author Two, ...
✓ Created: https://www.notion.so/...

If paper already exists:

✓ Paper already exists: https://www.notion.so/...

JSON output:

{
  "ok": true,
  "action": "created",  // or "exists"
  "page_id": "...",
  "page_url": "https://www.notion.so/...",
  "source": "arxiv"  // or "web_fetch" or "manual"
}

Error handling (non-arXiv URLs): If web scraping fails, returns structured error with suggestion:

{
  "ok": false,
  "error": "Could not extract metadata from URL",
  "suggestion": "Use browser tools to extract metadata, then use --manual mode"
}

Non-arXiv Paper Fallback

arXiv papers: Metadata auto-fetched from arXiv API (always works)

Non-arXiv URLs: Attempts web scraping via openclaw web_fetch

Success: Creates page with extracted metadata (may be incomplete)
Failure: Returns error with suggestion to use --manual mode

Manual mode workflow (for agent):

Try auto-save first: ./save_paper.py <url> --json

If ok: false:

Use browser tools to extract title, authors, abstract

Save with --manual mode:

./save_paper.py --manual \
  --title "Paper Title" \
  --url "https://..." \
  --authors "Author1; Author2; Author3" \
  --abstract "..." \
  --date "2026-03-10"

Supported URLs:

arXiv (always works): https://arxiv.org/abs/...
OpenReview, ACL Anthology, bioRxiv, journal sites (web scraping fallback)
Any URL (manual mode as last resort)

Targets

Academic paper database:
- database_id: d3e3be7f-c96a-45de-8e7d-3a78298f9ccd
- data_source_id (query): 73e9f7f8-c667-4279-a62f-2c16c1885d0f

Hard rules

Only write when Selina explicitly asks to save/add/log.
Deduplicate: do not create a new entry if it already exists.
Prefer using the database property Link as the canonical key.
Authors field: Uses "Authors txt" (rich text, semicolon-separated), not "Author" (multi-select)

Dedup workflow

Normalize the URL (trim, remove tracking params when safe).
Query the data source filtering:
- Link equals the normalized URL
- If arXiv: also try canonical https://arxiv.org/abs/<id> if the user provided a PDF link.
If found: return the existing page URL; optionally offer to update metadata if missing.

Create workflow

If not found:

Create a new page in the database:
- POST /v1/pages with parent: {"database_id": <database_id>}
- Set properties:
  - Name (title): best available title (from arXiv/metadata)
  - Link (url): normalized URL
  - Optionally: Publisher, Type, Publishing/Release Date, Discipline, Topics, Status
Populate authors:
- Updated (Mar 10, 2026): Uses Authors txt property (rich text, semicolon-separated)
- Format: "Author1; Author2; Author3"
- No longer uses: Author property (multi-select) — causes schema overflow with 1000+ unique authors
- Example: properties["Authors txt"] = {"rich_text": [{"text": {"content": "Alice; Bob; Charlie"}}]}
Populate abstract:
- There is no dedicated "Abstract" property.
- Put the abstract into the page body as blocks (preferred):
  - quote (common in Selina's pages) OR heading_2: "Abstract"
  - paragraph (or quote content): abstract text
- Optionally use TLDR for a short summary (not the full abstract), if Selina wants that later.

Metadata extraction

If URL is arXiv

Accept input forms:
- https://arxiv.org/abs/<id>
- https://arxiv.org/pdf/<id>.pdf
- bare <id> like 2501.01234
Fetch metadata from one of:
1. arXiv API: http://export.arxiv.org/api/query?id_list=<id> (Atom XML)
2. Fallback: scrape the arXiv abs page.
Extract:
- title
- authors
- abstract
- published date (optional)

If URL is not arXiv

Use web_fetch to get the page and extract:
- title (best guess)
- author list if present
- abstract/summary if present
If not reliably available, create entry with just Name + Link and leave placeholders.

Suggested property mapping (when confident)

Publisher: set to bioRxiv, arXiv, NeurIPS, etc. only if clearly indicated.
Type: default to Academic Journal for papers; Blog Post for posts.
Status: default to Ready to Start.
Discipline: add tags like ML, MechInterp, Geometry, ScienceofDL if strongly implied.

UX

Before writing:

Confirm the normalized link + the guessed title. After writing:
Return the created/updated Notion page URL.
Briefly list what fields were populated.