arxiv

name: arxiv description: arXiv preprint server — keyword search the official API and download papers as PDF / HTML / source tarball. Use whenever the user mentions an arXiv ID (e.g. 2106.04559) or wants preprints on a topic in CS / physics / math / quantitative biology. For biomedical preprints specifically, prefer biorxiv skill. license: Unknown (arXiv content under various licenses, see paper metadata) metadata: skill-author: VenusFactory2.

Overview

Two complementary tools: existing query_arxiv_tool (text search, returns JSON list inline) + new download_arxiv_paper_by_id (fetches the actual paper as PDF / HTML / source tarball to disk).

Project Tools (VenusFactory2)

Tool	Args	Returns	Description
query_arxiv	`query` (text), `max_results` (default 5, max 50), `max_content_length` (default 10000)	JSON list of paper records (title, abstract, arxiv_id, authors, etc.) inline	Search-only; small JSON returned to agent context.
download_arxiv_paper_by_id	`arxiv_id` (e.g. `2106.04559`, `2106.04559v2`, `hep-th/9510017`), `out_dir`, `format` (`pdf` \| `html` \| `source`; default `pdb`), `timeout` (default 60s)	rich JSON envelope; file at `file_info.file_path` (`arxiv_<id>.pdf` etc.)	Download the actual paper.

When to Use Each

Goal	Tool
"Find papers about X"	`query_arxiv`
User provided an arxiv id and wants the PDF	`download_arxiv_paper_by_id`
Build a literature review pipeline (search → read PDFs)	`query_arxiv` → `download_arxiv_paper_by_id` for each result
Get the LaTeX source / figures for a paper	`download_arxiv_paper_by_id` with `format=source` (tar.gz)

ID Formats Accepted

Modern: 2106.04559, 2106.04559v2 (with version)
Old-style: hep-th/9510017, cond-mat/0411174
The tool strips the arxiv: / arXiv: prefix if present.

Format Details

`format`	URL pattern	When to use
`pdf` (default)	`https://arxiv.org/pdf/<id>.pdf`	Read the paper
`html`	`https://arxiv.org/html/<id>`	Modern papers only (post-2023); fails 404 on older papers
`source`	`https://export.arxiv.org/e-print/<id>`	LaTeX source as `.tar.gz` (for reproducing figures, extracting data, etc.)

Rate Limiting

arXiv asks for ~3-second spacing between requests. The tool doesn't enforce this internally; if you fire many in a loop, sleep yourself.
Bulk downloads should use the arXiv full-text bulk dataset on AWS S3, not this tool.

Common Mistakes

HTML format on an old paper: 404. Fall back to pdf.
PDF response is HTML: arXiv occasionally serves an error page with 200 OK; the tool checks %PDF magic bytes and returns DownloadError if the bytes aren't a real PDF.
No abstract from download_arxiv_paper_by_id: this tool only downloads; for abstract / metadata, use query_arxiv (or fetch the abstract page via WebFetch).