name: arxiv description: Search and retrieve academic papers from arXiv using their free REST API. No API key needed. Search by keyword, author, category, or ID. Combine with web_extract or the ocr-and-documents skill to read full paper content. version: 1.0.0 metadata: echo: tags: [Research, Arxiv, Papers, Academic, Science, API]
arXiv Research
Search and retrieve academic papers from arXiv via their free REST API. No API key, no dependencies — just curl.
Quick Reference
| Action | Command |
|---|---|
| Search papers | curl "https://export.arxiv.org/api/query?search_query=all:QUERY&max_results=5" |
| Get specific paper | curl "https://export.arxiv.org/api/query?id_list=2402.03300" |
| Read abstract (web) | web_extract(urls=["https://arxiv.org/abs/2402.03300"]) |
| Read full paper (PDF) | web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) |
Searching Papers
The API returns Atom XML. Parse with grep/sed or pipe through python3 for clean output.
Basic search
curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5"
Clean output (parse XML to readable format)
curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5&sortBy=submittedDate&sortOrder=descending" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom'}
root = ET.parse(sys.stdin).getroot()
for i, entry in enumerate(root.findall('a:entry', ns)):
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
arxiv_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
published = entry.find('a:published', ns).text[:10]
authors = ', '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
summary = entry.find('a:summary', ns).text.strip()[:200]
cats = ', '.join(c.get('term') for c in entry.findall('a:category', ns))
print(f'{i+1}. [{arxiv_id}] {title}')
print(f' Authors: {authors}')
print(f' Published: {published} | Categories: {cats}')
print(f' Abstract: {summary}...')
print(f' PDF: https://arxiv.org/pdf/{arxiv_id}')
print()
"
Helper script
For repeated searches, use the bundled script:
python3 scripts/search_arxiv.py "GRPO reinforcement learning"
python3 scripts/search_arxiv.py --author "Yann LeCun" --max 5
python3 scripts/search_arxiv.py --category cs.AI --sort date --max 10
python3 scripts/search_arxiv.py --id 2402.03300
Search Query Syntax
| Prefix | Searches | Example |
|---|---|---|
all: |
All fields | all:transformer+attention |
ti: |
Title | ti:large+language+models |
au: |
Author | au:vaswani |
abs: |
Abstract | abs:reinforcement+learning |
cat: |
Category | cat:cs.AI |
co: |
Comment | co:accepted+NeurIPS |
Boolean operators
# AND (default when using +)
search_query=all:transformer+attention
# OR
search_query=all:GPT+OR+all:BERT
# AND NOT
search_query=all:language+model+ANDNOT+all:vision
# Exact phrase
search_query=ti:"chain+of+thought"
# Combined
search_query=au:hinton+AND+cat:cs.LG
Sort and Pagination
| Parameter | Options |
|---|---|
sortBy |
relevance, lastUpdatedDate, submittedDate |
sortOrder |
ascending, descending |
start |
Result offset (0-based) |
max_results |
Number of results (default 10, max 30000) |
# Latest 10 papers in cs.AI
curl -s "https://export.arxiv.org/api/query?search_query=cat:cs.AI&sortBy=submittedDate&sortOrder=descending&max_results=10"
Fetching Specific Papers
# By arXiv ID
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300"
# Multiple papers
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300,2401.12345,2403.00001"
BibTeX Generation
After fetching metadata for a paper, generate a BibTeX entry:
{% raw %}
curl -s "https://export.arxiv.org/api/query?id_list=1706.03762" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}
root = ET.parse(sys.stdin).getroot()
entry = root.find('a:entry', ns)
if entry is None: sys.exit('Paper not found')
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
authors = ' and '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
year = entry.find('a:published', ns).text[:4]
raw_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
cat = entry.find('arxiv:primary_category', ns)
primary = cat.get('term') if cat is not None else 'cs.LG'
last_name = entry.find('a:author', ns).find('a:name', ns).text.split()[-1]
print(f'@article{{{last_name}{year}_{raw_id.replace(\".\", \"\")},')
print(f' title = {{{title}}},')
print(f' author = {{{authors}}},')
print(f' year = {{{year}}},')
print(f' eprint = {{{raw_id}}},')
print(f' archivePrefix = {{arXiv}},')
print(f' primaryClass = {{{primary}}},')
print(f' url = {{https://arxiv.org/abs/{raw_id}}}')
print('}')
"
{% endraw %}
Reading Paper Content
After finding a paper, read it:
# Abstract page (fast, metadata + abstract)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
# Full paper (PDF → markdown via Firecrawl)
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
Common Categories
| Category | Field |
|---|---|
cs.AI |
Artificial Intelligence |
cs.CL |
Computation and Language (NLP) |
cs.CV |
Computer Vision |
cs.LG |
Machine Learning |
cs.CR |
Cryptography and Security |
stat.ML |
Machine Learning (Statistics) |
math.OC |
Optimization and Control |
physics.comp-ph |
Computational Physics |
Full list: https://arxiv.org/category_taxonomy
Semantic Scholar Integration
For citation data and related papers (JSON, no auth needed for basic use):
# Citation count
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300?fields=citationCount,influentialCitationCount"
# References
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/references?fields=title,citationCount&limit=20"
# Author search
curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=Yann+LeCun"
Typical Research Workflow
- Search:
curlthe arXiv API or usescripts/search_arxiv.py - Assess impact:
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID?fields=citationCount,influentialCitationCount" - Read abstract:
web_extract(urls=["https://arxiv.org/abs/ID"]) - Read full paper:
web_extract(urls=["https://arxiv.org/pdf/ID"]) - Find related work:
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID/references?fields=title,citationCount&limit=20" - Get recommendations: POST to Semantic Scholar recommendations endpoint
- Track authors:
curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=NAME"
Rate Limits
| API | Rate | Auth |
|---|---|---|
| arXiv | ~1 req / 3 seconds | None needed |
| Semantic Scholar | 1 req / second | None (100/sec with API key) |
Notes
- arXiv returns Atom XML — use the helper script or parsing snippet for clean output
- Semantic Scholar returns JSON — pipe through
python3 -m json.toolfor readability - arXiv IDs: old format (
hep-th/0601001) vs new (2402.03300) - PDF:
https://arxiv.org/pdf/{id}— Abstract:https://arxiv.org/abs/{id} - HTML (when available):
https://arxiv.org/html/{id}
ID Versioning
arxiv.org/abs/1706.03762always resolves to the latest versionarxiv.org/abs/1706.03762v1points to a specific immutable version- When generating citations, preserve the version suffix you actually read to prevent citation drift (a later version may substantially change content)
- The API
<id>field returns the versioned URL (e.g.,http://arxiv.org/abs/1706.03762v7)
Withdrawn Papers
Papers can be withdrawn after submission. When this happens:
- The
<summary>field contains a withdrawal notice (look for "withdrawn" or "retracted") - Metadata fields may be incomplete
- Always check the summary before treating a result as a valid paper