arxiv - SKILL.md Agent Skill

name: arxiv description: Search and retrieve academic papers from arXiv using their free REST API. No API key needed. Search by keyword, author, category, or ID. Combine with web_extract or the ocr-and-documents skill to read full paper content. version: 1.0.0 metadata: echo: tags: [Research, Arxiv, Papers, Academic, Science, API]

arXiv Research

Search and retrieve academic papers from arXiv via their free REST API. No API key, no dependencies — just curl.

Quick Reference

Action	Command
Search papers	`curl "https://export.arxiv.org/api/query?search_query=all:QUERY&max_results=5"`
Get specific paper	`curl "https://export.arxiv.org/api/query?id_list=2402.03300"`
Read abstract (web)	`web_extract(urls=["https://arxiv.org/abs/2402.03300"])`
Read full paper (PDF)	`web_extract(urls=["https://arxiv.org/pdf/2402.03300"])`

Searching Papers

The API returns Atom XML. Parse with grep/sed or pipe through python3 for clean output.

Basic search

curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5"

Clean output (parse XML to readable format)

curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5&sortBy=submittedDate&sortOrder=descending" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom'}
root = ET.parse(sys.stdin).getroot()
for i, entry in enumerate(root.findall('a:entry', ns)):
    title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
    arxiv_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
    published = entry.find('a:published', ns).text[:10]
    authors = ', '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
    summary = entry.find('a:summary', ns).text.strip()[:200]
    cats = ', '.join(c.get('term') for c in entry.findall('a:category', ns))
    print(f'{i+1}. [{arxiv_id}] {title}')
    print(f'   Authors: {authors}')
    print(f'   Published: {published} | Categories: {cats}')
    print(f'   Abstract: {summary}...')
    print(f'   PDF: https://arxiv.org/pdf/{arxiv_id}')
    print()
"

Helper script

For repeated searches, use the bundled script:

python3 scripts/search_arxiv.py "GRPO reinforcement learning"
python3 scripts/search_arxiv.py --author "Yann LeCun" --max 5
python3 scripts/search_arxiv.py --category cs.AI --sort date --max 10
python3 scripts/search_arxiv.py --id 2402.03300

Search Query Syntax

Prefix	Searches	Example
`all:`	All fields	`all:transformer+attention`
`ti:`	Title	`ti:large+language+models`
`au:`	Author	`au:vaswani`
`abs:`	Abstract	`abs:reinforcement+learning`
`cat:`	Category	`cat:cs.AI`
`co:`	Comment	`co:accepted+NeurIPS`

Boolean operators

# AND (default when using +)
search_query=all:transformer+attention

# OR
search_query=all:GPT+OR+all:BERT

# AND NOT
search_query=all:language+model+ANDNOT+all:vision

# Exact phrase
search_query=ti:"chain+of+thought"

# Combined
search_query=au:hinton+AND+cat:cs.LG

Sort and Pagination

Parameter	Options
`sortBy`	`relevance`, `lastUpdatedDate`, `submittedDate`
`sortOrder`	`ascending`, `descending`
`start`	Result offset (0-based)
`max_results`	Number of results (default 10, max 30000)

# Latest 10 papers in cs.AI
curl -s "https://export.arxiv.org/api/query?search_query=cat:cs.AI&sortBy=submittedDate&sortOrder=descending&max_results=10"

Fetching Specific Papers

# By arXiv ID
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300"

# Multiple papers
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300,2401.12345,2403.00001"

BibTeX Generation

After fetching metadata for a paper, generate a BibTeX entry:

{% raw %}

curl -s "https://export.arxiv.org/api/query?id_list=1706.03762" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}
root = ET.parse(sys.stdin).getroot()
entry = root.find('a:entry', ns)
if entry is None: sys.exit('Paper not found')
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
authors = ' and '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
year = entry.find('a:published', ns).text[:4]
raw_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
cat = entry.find('arxiv:primary_category', ns)
primary = cat.get('term') if cat is not None else 'cs.LG'
last_name = entry.find('a:author', ns).find('a:name', ns).text.split()[-1]
print(f'@article{{{last_name}{year}_{raw_id.replace(\".\", \"\")},')
print(f'  title     = {{{title}}},')
print(f'  author    = {{{authors}}},')
print(f'  year      = {{{year}}},')
print(f'  eprint    = {{{raw_id}}},')
print(f'  archivePrefix = {{arXiv}},')
print(f'  primaryClass  = {{{primary}}},')
print(f'  url       = {{https://arxiv.org/abs/{raw_id}}}')
print('}')
"

{% endraw %}

Reading Paper Content

After finding a paper, read it:

# Abstract page (fast, metadata + abstract)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

# Full paper (PDF → markdown via Firecrawl)
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

Common Categories

Category	Field
`cs.AI`	Artificial Intelligence
`cs.CL`	Computation and Language (NLP)
`cs.CV`	Computer Vision
`cs.LG`	Machine Learning
`cs.CR`	Cryptography and Security
`stat.ML`	Machine Learning (Statistics)
`math.OC`	Optimization and Control
`physics.comp-ph`	Computational Physics

Full list: https://arxiv.org/category_taxonomy

Semantic Scholar Integration

For citation data and related papers (JSON, no auth needed for basic use):

# Citation count
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300?fields=citationCount,influentialCitationCount"

# References
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/references?fields=title,citationCount&limit=20"

# Author search
curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=Yann+LeCun"

Typical Research Workflow

Search: curl the arXiv API or use scripts/search_arxiv.py
Assess impact: curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID?fields=citationCount,influentialCitationCount"
Read abstract: web_extract(urls=["https://arxiv.org/abs/ID"])
Read full paper: web_extract(urls=["https://arxiv.org/pdf/ID"])
Find related work: curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID/references?fields=title,citationCount&limit=20"
Get recommendations: POST to Semantic Scholar recommendations endpoint
Track authors: curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=NAME"

Rate Limits

API	Rate	Auth
arXiv	~1 req / 3 seconds	None needed
Semantic Scholar	1 req / second	None (100/sec with API key)

Notes

arXiv returns Atom XML — use the helper script or parsing snippet for clean output
Semantic Scholar returns JSON — pipe through python3 -m json.tool for readability
arXiv IDs: old format (hep-th/0601001) vs new (2402.03300)
PDF: https://arxiv.org/pdf/{id} — Abstract: https://arxiv.org/abs/{id}
HTML (when available): https://arxiv.org/html/{id}

ID Versioning

arxiv.org/abs/1706.03762 always resolves to the latest version
arxiv.org/abs/1706.03762v1 points to a specific immutable version
When generating citations, preserve the version suffix you actually read to prevent citation drift (a later version may substantially change content)
The API <id> field returns the versioned URL (e.g., http://arxiv.org/abs/1706.03762v7)

Withdrawn Papers

Papers can be withdrawn after submission. When this happens:

The <summary> field contains a withdrawal notice (look for "withdrawn" or "retracted")
Metadata fields may be incomplete
Always check the summary before treating a result as a valid paper