mosaic - SKILL.md Agent Skill

name: mosaic description: > Expert knowledge of MOSAIC (Multi-source Scientific Article Indexer and Collector) — a CLI tool for searching, downloading, and managing scientific papers from 21 sources with a single command. Use this skill whenever the user asks about: building a bibliography programmatically, searching for papers across multiple sources, downloading OA PDFs, formatting citation strings (BibTeX/APA/ MLA/Chicago), exporting to BibTeX/Zotero/Obsidian, interpreting mosaic --json output in AI agent or CI workflows, RAG over a paper library, semantic search over a local paper library, finding similar papers, analysing citation networks, comparing papers across structured dimensions, or any task that involves mosaic search/get/cite/similar/ask/chat/index/network/compare/skill commands. When in doubt, trigger this skill — it is better to consult it unnecessarily than to miss it.

MOSAIC Expert Knowledge

MOSAIC fans out paper searches across 21 scientific sources, deduplicates results by DOI, caches them in a local SQLite database, and can download OA PDFs. It provides structured JSON output for AI agent and CI workflows.

CLI Commands

mosaic search "query"           # search all enabled sources
mosaic get <doi>                # fetch metadata + download PDF by DOI
mosaic cite <doi>               # format citation string (BibTeX/APA/MLA/Chicago/…)
mosaic similar <doi|arxiv_id>   # find related papers via OpenAlex + Semantic Scholar
mosaic network                  # explore citation network, identify hubs and clusters
mosaic compare                  # structured comparison table across cached papers (LLM or metadata)
mosaic index                    # build/update vector index for RAG
mosaic ask "question"           # RAG Q&A over cached papers
mosaic chat                     # interactive multi-turn RAG session
mosaic config --show            # view or edit configuration
mosaic cache list               # inspect local SQLite cache
mosaic cache stats              # cache statistics
mosaic notebook create "topic"  # create a Google NotebookLM notebook
mosaic auth login elsevier      # browser session for authenticated PDF access
mosaic skill install            # install this Claude Code skill to the current project
mosaic skill install --global   # install to ~/.claude/skills/ (available in all projects)
mosaic skill show               # print skill content to stdout

JSON Output (scripting / AI agents)

Add --json to search or similar for machine-readable stdout. All rich table output is suppressed; results are written to stdout as a single JSON object. Papers are still saved to the local cache so subsequent --cached queries work immediately.

mosaic search "attention mechanism" --max 20 --oa-only --json
mosaic similar 10.48550/arXiv.1706.03762 --max 15 --json

JSON schema — search

{
  "status": "ok",
  "query": "attention mechanism",
  "count": 3,
  "papers": [
    {
      "title": "Attention Is All You Need",
      "authors": ["Vaswani, Ashish", "Shazeer, Noam"],
      "year": 2017,
      "doi": "10.48550/arXiv.1706.03762",
      "arxiv_id": "1706.03762",
      "pii": null,
      "abstract": "The dominant sequence transduction models...",
      "journal": null,
      "volume": null,
      "issue": null,
      "pages": null,
      "pdf_url": "https://arxiv.org/pdf/1706.03762",
      "source": "arxiv",
      "is_open_access": true,
      "url": "https://arxiv.org/abs/1706.03762",
      "citation_count": 50000,
      "relevance_score": null,
      "uid": "10.48550/arxiv.1706.03762"
    }
  ],
  "errors": []
}

status is "ok" (errors are non-fatal warnings from individual sources, not fatal failures). uid is the deduplication key used by the cache: prefers DOI → arxiv_id → pii → title slug. Fields are always present; unavailable values are null.

JSON schema — similar

Same as above but with an extra "seed" key:

{
  "status": "ok",
  "seed": "Attention Is All You Need",
  "query": "10.48550/arXiv.1706.03762",
  "count": 10,
  "papers": [...],
  "errors": []
}

Exit code is 0 on success, 1 on fatal failure (bad identifier, no results).

Agent scripting — bash

result=$(mosaic search "transformer architecture" --max 30 --oa-only --json)
count=$(echo "$result" | jq '.count')
dois=$(echo "$result" | jq -r '.papers[].doi | select(. != null)')
pdfs=$(echo "$result" | jq -r '.papers[] | select(.pdf_url != null) | .doi')
echo "Found $count papers, $(echo "$pdfs" | wc -l) with PDF"

Agent scripting — Python

import json, subprocess

def mosaic_json(args: list[str]) -> dict:
    r = subprocess.run(["mosaic"] + args, capture_output=True, text=True, check=False)
    if r.returncode != 0 and not r.stdout.strip():
        raise RuntimeError(f"mosaic failed: {r.stderr}")
    return json.loads(r.stdout)

# Search and parse
data = mosaic_json(["search", "FDTD high-order", "--max", "25", "--json"])
papers = data["papers"]
oa_papers = [p for p in papers if p["is_open_access"]]
print(f"Found {data['count']} papers, {len(oa_papers)} open-access")

# Find similar to the most-cited result
top = max(papers, key=lambda p: p["citation_count"] or 0)
if top["doi"]:
    related = mosaic_json(["similar", top["doi"], "--max", "10", "--json"])

search Command

mosaic search "query" [OPTIONS]

Option	Default	Description
`--max`, `-n`	10	Max results per source
`--source`, `-s`	(all)	Limit to one source shorthand (see table below)
`--oa-only`	off	Open-access papers only
`--pdf-only`	off	Papers with downloadable PDF only
`--year`, `-y`	—	Year filter: `"2020"`, `"2020-2024"`, or `"2020,2022,2024"`
`--author`, `-a`	—	Author name filter (repeatable)
`--journal`, `-j`	—	Journal name filter (substring match)
`--field`, `-f`	`all`	Scope query to `"title"`, `"abstract"`, or `"all"`
`--raw-query`	—	Send query directly to source API, bypass field transforms
`--sort`	—	Sort order: `"citations"`, `"year"`, or `"relevance"`
`--download`, `-d`	off	Download available PDFs after search
`--output`, `-o`	—	Save results to file (`.md`, `.csv`, `.json`, `.bib`, `.ris`); repeatable
`--cached`	off	Search only the local cache — no network requests
`--semantic`	off	Search local vector index by meaning (requires `mosaic index` + embedding model); shows Sim. column
`--downloaded-only`	off	Restrict to papers with a locally downloaded PDF (only with `--cached` or `--semantic`)
`--prefer-cache`	off	Prefer richer cached records over freshly fetched data
`--stats`	off	Print per-source counts and deduplication stats
`--zotero`	off	Export results to Zotero
`--zotero-collection`	—	Zotero collection name (created if missing)
`--obsidian`	off	Export results as notes to an Obsidian vault
`--json`	off	Emit structured JSON to stdout (suppresses table output)

Source Shorthands

Shorthand	Source	Coverage	Auth
`arxiv`	arXiv	Physics, CS, Math, Biology	None
`ss`	Semantic Scholar	214 M papers, all disciplines	Optional key
`sd`	ScienceDirect	Elsevier journals & books	API key or browser
`sp`	Springer (browser)	Springer, Nature (browser)	`[browser]` extra
`springer`	Springer API	OA Springer/Nature articles	Free API key
`doaj`	DOAJ	8 M+ fully OA articles	None
`epmc`	Europe PMC	45 M biomedical papers	None
`oa`	OpenAlex	250 M+ works	None
`base`	BASE	300 M+ from 10k+ repos	None
`core`	CORE	200 M+ OA full-text	Free API key
`ads`	NASA ADS	Astronomy & astrophysics	Free API token
`ieee`	IEEE Xplore	5 M+ IEEE papers	Free API key
`zenodo`	Zenodo	3 M+ OA research outputs	None
`crossref`	Crossref	150 M+ DOI registry	None
`dblp`	DBLP	6 M+ CS publications	None
`hal`	HAL	1.5 M+ French academic OA	None
`pubmed`	PubMed	35 M+ biomedical citations	Optional key
`pmc`	PubMed Central	5 M+ free full-text biomedical	Optional key
`rxiv`	bioRxiv/medRxiv	Life science preprints	None
`pedro`	PEDro	Physiotherapy evidence	Fair-use ack
`scopus`	Scopus	90 M+ Elsevier citations	API key or browser

get Command

mosaic get <doi>                # single DOI — fetch metadata + download PDF
mosaic get --from refs.bib      # bulk-download from BibTeX file
mosaic get --from library.csv   # bulk-download from CSV file (must have 'doi' column)

Options: --oa-only, --download-dir, --zotero, --zotero-collection, --obsidian.

cite Command

Format and print a citation string for a paper by DOI. Checks the local cache first; falls back to Crossref on a cache miss. BibTeX is rendered locally; all other styles use Crossref content negotiation (network required).

mosaic cite <doi>                        # BibTeX (default) — no network if cached
mosaic cite <doi> --style apa            # APA via doi.org content negotiation
mosaic cite <doi> --style mla
mosaic cite <doi> --style chicago
mosaic cite <doi> --style harvard
mosaic cite <doi> --style vancouver
mosaic cite <doi> --style apa --copy     # copy to clipboard (pbcopy/xclip/clip fallback)

Option	Default	Description
`--style`, `-s`	`bibtex`	Citation style; tab-completes: `bibtex apa mla chicago harvard vancouver`
`--copy`, `-c`	off	Copy result to clipboard

similar Command

mosaic similar 10.48550/arXiv.1706.03762   # by DOI
mosaic similar arxiv:1706.03762             # by arXiv ID
mosaic similar <doi> --max 20 --sort citations --json

Uses OpenAlex related_works (always) and Semantic Scholar recommendations (when API key is configured). Options are the same as search minus --source and --year.

Export Formats

Extension	Format
`.bib`	BibTeX
`.ris`	RIS (Mendeley, Endnote, Reference Manager)
`.csv`	CSV table
`.json`	JSON array of paper objects
`.md` / `.markdown`	Markdown table

# Save to multiple formats in one command
mosaic search "deep learning" --output refs.bib --output summary.md

network Command

Explore the local citation graph built by mosaic index --enrich-citations.

mosaic network [OPTIONS]

Option	Default	Description
`--query`, `-q`	—	Seed graph from cached papers matching this query (BFS subgraph)
`--depth`	2	Citation hops to follow from seed papers
`--min-connections`	1	Exclude papers with fewer edges than this
`--cluster`	off	Group papers into topic clusters (Louvain if `networkx` installed, else connected components)
`--output`, `-o`	—	Write graph to file: `.json` (D3/Gephi node-link), `.gv` (Graphviz DOT), `.md` (Mermaid)
`--top`	5	Most-connected papers to show per cluster in terminal output

Requires citation edges — run mosaic index --enrich-citations first. Louvain clustering requires networkx: pipx inject mosaic-search networkx.

# Most-connected papers in the full graph
mosaic network --top 10

# Topic subgraph with community clusters
mosaic network --query "transformer attention" --depth 2 --cluster --top 5

# Export for downstream tools
mosaic network --output graph.json   # D3.js / Gephi / NetworkX
mosaic network --output graph.gv     # Graphviz: dot -Tpng graph.gv -o graph.png
mosaic network --output graph.md     # Mermaid diagram for README / Obsidian

# Combine: topic subgraph → cluster report → save Mermaid
mosaic network --query "diffusion models" --cluster --top 5 --output diffusion.md

JSON node-link schema

{
  "nodes": [
    {
      "id": "doi:10.48550/arxiv.1706.03762",
      "title": "Attention Is All You Need",
      "year": 2017,
      "authors": "Vaswani et al.",
      "citation_count": 85000,
      "cluster": 0
    }
  ],
  "links": [
    { "source": "doi:10.48550/...", "target": "doi:10.18653/..." }
  ]
}

cluster is null when --cluster is not used.

compare Command

Generate a structured comparison table across cached papers. With a configured LLM, extracts dimensions from each paper's title + abstract. Without one, populates only metadata fields and prints a notice — never fails silently.

mosaic compare [OPTIONS]

Option	Default	Description
`--query`, `-q`	—	Filter papers from cache by title/abstract
`--from`	—	Load papers from a `.bib` or `.csv` file
`--max`, `-n`	20	Maximum number of papers to compare
`--dimensions`	`method,dataset,metric,result`	Comma-separated comparison axes
`--output`, `-o`	—	Write table to file: `.md`, `.csv`, `.json`
`--sort`	—	Pre-sort papers: `citations` (most cited first) or `year` (newest first)

# Compare top-cited cached papers on a topic (LLM fills in method/dataset/metric/result)
mosaic compare --query "diffusion models" --sort citations -n 15

# Save as Markdown
mosaic compare --query "transformer attention" --output comparison.md

# Custom dimensions from a BibTeX file
mosaic compare --from refs.bib --dimensions "method,dataset,BLEU,limitations"

# Export as CSV for Excel / Google Sheets
mosaic compare --query "GNN" -n 20 --output gnn-comparison.csv

# Export as JSON for scripting
mosaic compare --query "protein folding" --output folding.json

Metadata-only dimensions (no LLM needed): year, source, journal, doi, authors, citations. All other dimension names require an LLM and return – without one.

LLM setup (same config as RAG):

mosaic config --llm-provider openai --llm-api-key YOUR_KEY
# or Anthropic:
mosaic config --llm-provider anthropic --llm-api-key YOUR_KEY
# or local Ollama:
mosaic config --llm-provider openai --llm-base-url http://localhost:11434/v1 --llm-api-key ollama

RAG Commands

# 1. Build/update the vector index (incremental — already-indexed papers are skipped)
mosaic index

# 2. Semantic search — retrieve by meaning, no LLM needed at query time
mosaic search "methods that learn without labels" --semantic          # ranked paper list + Sim. column
mosaic search "attention mechanism" --semantic --downloaded-only      # only papers on disk
mosaic search "diffusion model" --semantic -n 20 --sort citations     # sort by citations after retrieval

# 3. Single-shot analysis (LLM required)
mosaic ask "What FDTD schemes achieve high-order accuracy in time?" --mode synthesis
mosaic ask "What open problems remain in discontinuous Galerkin methods?" --mode gaps
mosaic ask "Compare DDPM, DDIM, and score SDE" --mode compare --output report.md
mosaic ask "Extract all methods with accuracy claims" --mode extract

# 4. Interactive session
mosaic chat

--semantic: embeds the query and retrieves top-k papers from the vector index. Shows a Sim. column (0–1). No LLM needed at query time. Requires mosaic index + embedding model.

Modes for mosaic ask: synthesis (state of the art), gaps (open problems), compare (side-by-side methods), extract (structured per-paper data extraction).

Requires sqlite-vec (pipx inject mosaic-search sqlite-vec) and a configured embedding model

LLM. See mosaic config --embedding-model ... / --llm-provider ....

Configuration

# View full config (TOML-formatted)
mosaic config --show

# Essential setup
mosaic config --unpaywall-email you@example.com   # enables Unpaywall PDF fallback

# API keys
mosaic config --elsevier-key YOUR_KEY             # ScienceDirect
mosaic config --ss-key YOUR_KEY                   # Semantic Scholar
mosaic config --springer-key YOUR_KEY             # Springer API
mosaic config --ads-key YOUR_KEY                  # NASA ADS
mosaic config --ieee-key YOUR_KEY                 # IEEE Xplore

# LLM (for RAG and relevance ranking)
mosaic config \
  --llm-provider openai \
  --llm-api-key YOUR_KEY \
  --llm-model gpt-4o-mini

# Ollama (local LLM — no data leaves your machine)
mosaic config \
  --embedding-model snowflake-arctic-embed2 \
  --embedding-base-url http://localhost:11434/v1 \
  --embedding-api-key ollama \
  --llm-provider openai \
  --llm-base-url http://localhost:11434/v1 \
  --llm-api-key ollama \
  --llm-model llama3.2

# Enable/disable sources
mosaic config --enable-source scopus
mosaic config --disable-source pedro

# Download location
mosaic config --download-dir ~/papers/

Config file: ~/.config/mosaic/config.toml Cache DB: ~/.local/share/mosaic/cache.db Default downloads: ~/mosaic-papers/

AI Agent Workflow: Building a Bibliography

import json, subprocess
from pathlib import Path

def mosaic(args: list[str]) -> dict:
    r = subprocess.run(["mosaic"] + args, capture_output=True, text=True, check=False)
    if r.returncode != 0 and not r.stdout.strip():
        raise RuntimeError(r.stderr)
    return json.loads(r.stdout)

# --- Step 1: Search multiple related queries ---
all_papers: list[dict] = []
queries = [
    "transformer self-attention",
    "BERT language model pre-training",
    "GPT autoregressive language model",
]
for q in queries:
    data = mosaic(["search", q, "--max", "15", "--oa-only", "--json"])
    all_papers.extend(data["papers"])

# --- Step 2: Deduplicate by uid (DOI / arXiv ID) ---
seen: set[str] = set()
unique: list[dict] = []
for p in all_papers:
    if p["uid"] not in seen:
        seen.add(p["uid"])
        unique.append(p)

# --- Step 3: Expand with similar papers for the top-cited seed ---
most_cited = max(unique, key=lambda p: p["citation_count"] or 0)
if most_cited.get("doi"):
    related = mosaic(["similar", most_cited["doi"], "--max", "10", "--json"])
    for p in related["papers"]:
        if p["uid"] not in seen:
            seen.add(p["uid"])
            unique.append(p)

# --- Step 4: Export the cached results to BibTeX ---
# (mosaic cache already has all papers from steps 1-3)
subprocess.run(["mosaic", "search", queries[0], "--cached", "--output", "bibliography.bib"])

# --- Step 5: Download all OA PDFs ---
for p in unique:
    if p["pdf_url"] and p.get("doi"):
        subprocess.run(["mosaic", "get", p["doi"]])

# --- Step 6: Index, enrich citations, and ask ---
subprocess.run(["mosaic", "index", "--enrich-citations"])
subprocess.run(["mosaic", "ask", "Summarise the evolution of attention mechanisms",
                "--mode", "synthesis", "--output", "synthesis.md"])

# --- Step 7: Explore the citation network ---
subprocess.run(["mosaic", "network", "--query", "attention mechanism",
                "--cluster", "--top", "5", "--output", "network.md"])

# --- Step 8: Compare methods across top-cited papers ---
subprocess.run(["mosaic", "compare", "--query", "attention mechanism",
                "--sort", "citations", "-n", "20", "--output", "comparison.md"])

Zotero Integration

# Push search results to a Zotero collection (Zotero must be running)
mosaic search "deep learning" --max 20 --zotero --zotero-collection "Deep Learning"

# Push + download PDFs
mosaic search "protein folding" --oa-only --download --zotero --zotero-collection "Bioinformatics"

# Bulk-download an existing .bib file and send to Zotero
mosaic get --from refs.bib --zotero --zotero-collection "Imported"

# Web API (no Zotero app needed)
mosaic config --zotero-key YOUR_WEB_API_KEY
mosaic search "FDTD" --zotero

Obsidian Integration

mosaic config --obsidian-vault ~/Notes
mosaic search "quantum computing" --obsidian --obsidian-folder "Papers/Quantum"

Each note gets YAML frontmatter, >[!abstract] callout, metadata table, and [[wikilinks]].

Skill Installation

# Install to current project's .claude/skills/mosaic/ — enables /mosaic in this project
mosaic skill install

# Install globally to ~/.claude/skills/mosaic/ — enables /mosaic in all projects
mosaic skill install --global

# Inspect the bundled skill content
mosaic skill show

After installation, restart Claude Code or open a new session. The /mosaic slash command will be available in that project's Claude Code context.

Full Documentation

https://szaghi.github.io/mosaic/